A video signal processor for motioncompensated field-rate upconversion in consumer television B. De Loore, P. Lippens, P. Eeckhout, H. Huijgen, A. Löning, B. McSweeney, M. Verstraelen, B. Pham, G. de Haan, J. Kettenis author return address: B. De Loore Philips GmbH, RHW Stresemannallee 101 D-22529 Hamburg Germany Philips Semiconductors Philips Research Philips Consumer Electronics Phone : + 49 40 5613 3691 Fax : + 49 40 5613 3392 Email : deloore@hhcich01.serigate.philips.nl A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 1
1.0 Abstract The four embedded video signal processors on this IC provide a processing power of 10 Gops. Their architecture was generated from an algorithm description using behavioural synthesis. The required 25 Gbit/s memory bandwidth was realized by embedding 24 single/dual port SRAM/DRAM instances. The test approach includes full scan, boundary scan, functional, built-in-self-test and IDDq-test. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 2
2.0 Paper summary In today s 100Hz television sets, the display rate is doubled by displaying incoming fields twice. Moving objects are displayed at an incorrect position in the interpolated fields (Fig. 1). In the new generation 100Hz television sets, this artefact is solved by calculating the motion vectors for all objects and performing motion compensated interpolation. Known algorithms for motion estimation and compensation require a huge number of computations. In [1], the 3D-recursive block matching algorithm is presented that renders a one-chip solution possible. The presented IC is also capable of judder-free motion portrayal of movie material (25Hz to 50Hz upconversion), noise reduction and vertical zoom. In Fig. 2, the chip architecture and application are shown. It consists of three subprocessors and one top-level processor. Motion estimation is performed by two recursive block matchers, using only 8 candidate vectors per block. Search spaces of current and previous fields have to be accessed randomly at the pixel frequency. This can only be performed using on chip cache memory. Candidate vector evaluation and selection of the best vector are also performed in the motion estimation processor. In the vector processor, the selected vectors are stored in a vector field memory, to be used for interpolation and as one of the candidates in the next field. The vector processor organizes the storage and access of the vector field. In order to obtain motion vectors for blocks of 4*2 pixels, it also postprocesses the originally calculated motion vectors. The interpolation processor generates the motion-compensated interpolated output video field. Here also, two video search spaces have to be accessed randomly. The vertical delays could be reused, the horizontal caches had to be duplicated. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 3
The top level processor synchronizes the data flow between the three subprocessors and peripheral functions like e.g. noise reduction, formatters and I/O. Externally, a video field memory and a microprocessor are required. Programmability was implemented to allow user-specific taste settings and picture sizes and to allow for fine tuning of parts of the algorithm that could only be evaluated in real-time. For this purpose, there are 28 on-chip 8-bit registers that can be read and/or written by the microprocessor. An implementation of the algorithm on existing microprocessors or DSPs is not feasible with the current state of microprocessor and DSP technology. The recursive block matching, which is only a small part of the algorithm, requires 14 operations per sample. With a sample frequency of 33 MHz this results in a required computing power of 462 MOPS. The integrated function requires a total computing power of 10 GOPS. Also the required memory bandwidth can only be offered by a dedicated processor. The chosen architecture requires a memory bandwidth of 25 Gbit/s. Time to market was a major issue in the design of an application specific processor for this algorithm. Register Transfer Level (RTL)design, using logic synthesis, does not offer a solution. The conversion of the behavioural specification into an RTL specification (Table 1) takes too much time: the designer has to take care of clock-level signal timing, memory organisation and addressing and the generation of all control signals. Because of the complex nature of the algorithm (recursive, quincunx) and the more than 100 modes of operation, such an approach would not have been possible in the given time frame. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 4
For this IC, the four video processors were generated using in-house behavioural synthesis tools from the Phideo toolset [2]. These tools take a behavioural specification and generate an architecture at the RT level. Arithmetic and logic operations are the functional primitives of the specification. A scheduler determines at which clock cycle what function is performed. From this time schedule, memories are inferred to store intermediate processing results, the memory organisation is determined and the addressing hardware is generated. A controller is then synthesized to control and synchronize the complete architecture. The designer selects architecture alternatives, the tools perform bookkeeping and detailed optimization tasks. The output of the behavioural synthesis toolbox is synthesizable RT L-VHDL, which can be translated to a gate level netlist using logic synthesis tools. Retiming at the gate level is performed to let the circuit run at the desired operating frequency of 33 MHz. An overview of the embedded memories is given in Table 2. Four different memory types had to be used to come to an acceptable area/performance ratio. Dual port 33 MHz random access is achieved using register-file generators. For single port 33 MHz instances, an embedded-sram module generator is used. The vector field, characterized by single port 16.5 MHz bandwidth, is implemented using an embedded-dram module generator. Area and power constraints imposed the use of DRAM technology for the 15 line delay instances. A dedicated line memory was designed with doubled data throughput at 33 Mhz. This was realized by making simultaneous read and write access to the same memory address possible within one memory cycle. The typical incremental access of these line delays also allowed to reduce their power dissipation with a factor of three, by the addition of a so called page mode. A single line memory instance stores A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 5
896x8 pixels, measures 1.1 mm2 and dissipates 0.5 mw/mhz at 5 Volts. In total, 170Kbits of data are stored in embedded memory. A single phase clocking scheme was adopted for the core of the IC. To limit clock skew, the clock scheme from Fig. 3 was adopted The clock generator was put near the center of the IC. It s output is connected with equally long wires to four large clock buffers, placed in the center of the four sides of the IC s periphery. The outputs of these clock buffers are again connected (ck_core) in a ring that surrounds the core. The core circuitry connects to this clock-ring locally. The clock generator also generates clock signals (ck_sync_in, ck_sync_out) with a small phase difference compared to the core clock signal. This was required to meet the timing requirements for the IC s inputs and outputs. In the datapath a full scan approach was adopted. The overall error coverage, based on a stuck-at fault model, exceeds 99%. This test may leave bridging faults undetected. These faults can be identified by measuring the quiescent current (IDDq) in the absence of a clock signal for different vector sets. Care was taken in the design of I/O and memory cells, to switch off all possible current leakage paths during this test. Timing problems remain undetected by scan test and IDDq test. A functional test is included in the vector set which activates the timing critical path. Memories require special attention, as they are extremely dense. Next to stuck-at and bridging errors, also coupling errors and specific decoder errors have to be identified. Specific test pattern sequences (6N, 9N algorithms) were used to test SRAM and DRAM modules properly. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 6
Finally, board level testing is made possible by the implementation of a boundary-scanlike test concept. The high quality standards must be met at the lowest possible cost. Test time is the major parameter determining the test cost. Minimization of test time was achieved by having 28 scan chains operating in parallel, the length of these chains being well balanced. When applied serially, the memory tests take too much time and the required vectors require too much of the tester s pin memory. Four built-in-self-test modules have been implemented to avoid this. These modules consist of a test-pattern generator for both data and addresses, a signature analyser and a state machine to control the complete test sequence. Table 3 shows the main characteristics of the realized chip. 65% of the transistors are in embedded memory. The IC spec was written and verified by a system designer. A team of three IC designers performed the synthesis. Functional samples were available, one year after project start. Fig. 4 shows a chip photograph. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 7
3.0 References [1]G. de Haan, J. Kettenis, B. De Loore: IC for motion-compensated 100Hz TV with smooth-motion movie-mode, Digest of the ICCE 95, Chicago. [2]P. E. R. Lippens, J. L. van Meerbergen, A. van der Werf, W. F. J. Verhaegh, B. T. McSweeney, J. O. Huisken, and O. P. McArdle, PHIDEO: A silicon compiler for high speed algorithms, in Proc. EDAC, Amsterdam, The Netherlands, Feb. 1991, pp. 436-441. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 8
position expected position of object in field 6 1 2 3 4 5 6 7 8 object in original fields object in interpolated fiels time (field nr) Fig. 1 : Field repetition causes moving objects to be perceived twice by the object tracking observer. input video external field memory external microprocessor pixel line delays current field pixel line delays previous field upinterface best match horiz. horiz. cache cache block matcher horiz. horiz. cache cache block matcher vector line delays best match selection motion estimation processor horiz. cache temporal vector prediction field delay vector postprocessing vector processor interpolator horiz. cache interpolation processor noise reduction formatting reformatting colour processing top level processor output video Fig. 2 : The IC architecture: one top-level processor organizes the data transfer between three processors and peripheral functions. A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 9
Table 1: Behavioural versus RT level spec behavioural spec RT level spec abstraction level algorithm architecture video image 3D-array serial pixel stream time base video frame clock cycle language C VHDL, VERILOG simulation time 1 500 code size 1 50 Table 2: Embedded memory instances instances R/W ports # words circuit style pixel line delay 15 2 ports, 8 bit,33 MHz horizontal cache 6 2 ports, 40 bit, 33 MHz vector field delay 1 1 port, 10 bit, 16.5 MHz vector line delay 2 1 port, 10 bit, 33 MHz 896 FIFO-type DRAM 15... 30 2-port register file 4096 1-port DRAM 54 1-port SRAM A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 10
ck_core ck_ext clk generator ck_sync_in ck_sync_out ck Fig. 3 : Clock distribution scheme Table 3: IC characteristics process 0.8 µ CMOS dissipation 1.8 W compute power 10 Gops memory bandwidth 25 Gbit/s die size 97 mm 2 embedded memory 170 Kbit transistor count 980,000 test vector length 260,000 max. clock rate 33 MHz package PLCC84 A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 11
Fig. 4 : Chip photograph A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 12
A video signal processor for motion-compensated field-rate upconversion in consumer television24 October 1995 13