ARCHITECTURAL CONSIDERATION OF TOPS-DSP FOR VIDEO PROCESSING. Takao Nishitani. Tokyo Metropolitan University

Similar documents
GRABLINKTM. FullTM. - DualBaseTM. - BaseTM. GRABLINK Full TM. GRABLINK DualBase TM. GRABLINK Base TM

Chapter 1: Introduction

Engineer To Engineer Note

CPE 200L LABORATORY 2: DIGITAL LOGIC CIRCUITS BREADBOARD IMPLEMENTATION UNIVERSITY OF NEVADA, LAS VEGAS GOALS:

Application Support. Product Information. Omron STI. Support Engineers are available at our USA headquarters from

Mapping Arbitrary Logic Functions into Synchronous Embedded Memories For Area Reduction on FPGAs

Safety Relay Unit G9SB

Safety Relay Unit G9SB

LCD Data Projector VPL-S500U/S500E/S500M

ECE 274 Digital Logic. Digital Design. Datapath Components Registers. Datapath Components Register with Parallel Load

Corporate Logo Guidelines

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 29. Setting Up the Projector 16

The Official IDENTITY SYSTEM. A Manual Concerning Graphic Standards and Proper Implementation. As developed and established by the

lookbook Transportation - Airports

VISUAL IDENTITY GUIDE

DIGITAL EFFECTS MODULE OWNER'S MANUAL

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 30. Setting Up the Projector 17

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 28. Setting Up the Projector 15

Applications to Transistors

lookbook Corporate LG provides a wide-array of display options that can enhance your brand and improve communications campus-wide.

Contents 2. Notations Used in This Guide 6. Introduction to Your Projector 7. Using Basic Projector Features 29. Setting Up the Projector 16

lookbook Higher Education

Notations Used in This Guide

Contents. English. English. Your remote control 2

Soft Error Derating Computation in Sequential Circuits

Chapter 3: Sequential Logic Design -- Controllers

Pro Series White Toner and Neon Range

Have they bunched yet? An exploratory study of the impacts of bus bunching on dwell and running times.

A New Concept of Providing Telemetry Data in Real Time

Outline. Circuits & Layout. CMOS VLSI Design

LOGICAL FOUNDATION OF MUSIC

Contents 2. Notations Used in This Guide 7. Introduction to Your Projector 8. Using Basic Projector Features 34. Setting Up the Projector 17

User's Guide. Downloaded from

Sequencer devices. Philips Semiconductors Programmable Logic Devices

Notations Used in This Guide

CMST 220 PUBLIC SPEAKING

Explosion protected add-on thermostat

MODELING OF BLOCK-BASED DSP SYSTEMS Dong-Ik Ko and Shuvra S. Bhattacharyya

INPUT CAPTURE WITH ST62 16-BIT AUTO-RELOAD TIMER

ECE 274 Digital Logic. Digital Design. Sequential Logic Design Controller Design: Laser Timer Example

Introduction. APPLICATION NOTE 712 DS80C400 Ethernet Drivers. Jun 06, 2003

walking. Rhythm is one P-.bythm is as Rhythm is built into our pitch, possibly even more so. heartbeats, or as fundamental to mu-

Reverse Iterative Deepening for Finite-Horizon MDPs with Large Branching Factors

ViaLite SatComs Fibre Optic Link

PRACTICE FINAL EXAM T T. Music Theory II (MUT 1112) w. Name: Instructor:

SeSSION 9. This session is adapted from the work of Dr.Gary O Reilly, UCD. Session 9 Thinking Straight Page 1

Lecture 3: Circuits & Layout

TAP 413-1: Deflecting electron beams in a magnetic field

months ending June 30th 2001 Innovators in image processing

Reverse Polarity Amphenol

User's Guide. Downloaded from

WE SERIES DIRECTIONAL CONTROL VALVES

Avaya P460. Quick Start Guide. Important Information. Unpack the Chassis. Position the Chassis. Install the Supervisor Module and PSU

Tran Thi Thanh Thao Ticker: STB - Exchange: HSX Recommend: HOLD Target price 2011: VND 15,800 STATISTICS

THE SOLAR NEIGHBORHOOD. XV. DISCOVERY OF NEW HIGH PROPER MOTION STARS WITH 0B4 yr 1 BETWEEN DECLINATIONS 47 AND 00

1. Connect the wall transformer to the mating connector on the Companion. Plug the transformer into a power outlet.

Evaluation of the Suitability of Acoustic Characteristics of Electronic Demung to the Original Demung

A Proposed Keystream Generator Based on LFSRs. Adel M. Salman Baghdad College for Economics Sciences

LCD VIDEO MONITOR PVM-L1700. OPERATION MANUAL [English] 1st Edition (Revised 2)

Reproducible music for 3, 4 or 5 octaves handbells or handchimes. by Tammy Waldrop. Contents. Performance Suggestions... 3

ViaLiteHD RF Fibre Optic Link

Chapter 5. Synchronous Sequential Logic. Outlines

Panel-mounted Thermostats

Your Summer Holiday Resource Pack: English

Solutions For Live Video & Television Productions. LiveXpert is a brand OF

MILWAUKEE ELECTRONICS NEWS

Standard Databases for Recognition of Handwritten Digits, Numerical Strings, Legal Amounts, Letters and Dates in Farsi Language

CPSC 121: Models of Computation Lab #2: Building Circuits

1 --FORMAT FOR CITATIONS & DOCUMENTATION-- ( ) YOU MUST CITE A SOURCE EVEN IF YOU PUT INFORMATION INTO YOUR OWN WORDS!

Predicted Movie Rankings: Mixture of Multinomials with Features CS229 Project Final Report 12/14/2006

Automatic Repositioning Technique for Digital Cell Based Window Comparators and Implementation within Mixed-Signal DfT Schemes

Your KIM. characters, along with a fancy. includes scrolling, erase to end of screen, full motions, and the usual goodies. The

Pitch I. I. Lesson 1 : Staff

Successful Transfer of 12V phemt Technology. Taiwan 333, ext 1557 TRANSFER MASK

arxiv: v2 [cs.sd] 13 Dec 2016

LCD VIDEO MONITOR PVM-L3200. OPERATION MANUAL [English] 1st Edition (Revised 1)

92.507/1. EYR 203, 207: novaflex universal controller. Sauter Systems

DRAFT. Vocal Music AOS 2 WB 3. Purcell: Music for a While. Section A: Musical contexts. How is this mood achieved through the following?

lookbook Corporate Images are simulated.

TAU 2013 Variation Aware Timing Analysis Contest

style type="text/css".wpb_animate_when_almost_visible { opacity: 1; }/style

For public transport information phone Bus 415. Easy access on all buses. Middleton Alkrington Middleton Junction Chadderton Oldham

Phosphor: Explaining Transitions in the User Interface Using Afterglow Effects

Implementation of an MPEG Codec on the Tilera TM 64 Processor

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

lookbook Senior Living

Binaural and temporal integration of the loudness of tones and noises

Electrospray Ionization Ion MoMlity Spectrometry

Cooing, Crying, and Babbling: A Link between Music and Prelinguistic Communication

Standards Overview (updated 7/31/17) English III Louisiana Student Standards by Collection Assessed on. Teach in Collection(s)

Synchronising Word Problem for DFAs

400 Series Flat Panel Monitor Arm

Minstruments for Analog Audio Signals

THE MOSSAT COLLECTION BOOK SIX

A VLSI Architecture for Variable Block Size Video Motion Estimation

Homework 1. Homework 1: Measure T CK-Q delay

Lossless Compression Algorithms for Direct- Write Lithography Systems

A Fast Constant Coefficient Multiplier for the XC6200

Interactions of Folk Melody and Transformational (Dis)continuities in Chen Yi s Ba Ban

Train times. Monday to Sunday

Transcription:

ARCHITECTURAL CONSIDERATION OF TOPS-DSP FOR VIDEO PROCESSING Tko Nishitni Tokyo Metropolitn University nishitni@eei.metro-u.c.jp ABSTRACT Possible DSP chip rchitecture with Ter-Opertions-Per - Second processing cpbility in future is considered for mobile pplictions, including dvnced HDTV processing. The pproch employed here is to minimize the power consumption under softwre progrmmble cpbility. The resultnt rchitecture is kind of progrmmble systolic rchitecture, but hs multiple segmented buses. Right now, such DSP implementtion still hs only 0.2 TOPS, the rchitecture will be the most promising one for future mobile DSP with TOPS cpbility.. INTRODUCTION Recent high-end cell phones hve lredy hd multimedi functions, bsed on dvnced video, imge nd udio processing. This trend is further ccelerted, due to the vilbility of low-cost high resolution cmers nd displys. A compct terminl with such low cost devices surely mkes it personl dvisor tool in the future, which let us know something importnt round us. For exmple, HDTV resolution cmer on cell-phone will check crowded streets to find out owner s cquintnce. However, the required processing mount of video nd udio is lso bursting in order to find out useful informtion round us through high resolution cmers. LSI processors to be used should hve dvnced processing cpbility which is fr beyond tht of tody s PC. In ddition, cell-phone s mobile cpbility highly depends on low power dissiption, becuse of keeping bttery lives long. Furthermore, progrmmbility is mndtory for relizing mny complex functions on the sme chip by ctivting softwre progrm stored in n internl ROM. The rchitecture presented here is one of the multicore processors which relize enhnced processing cpbility with reltively low power dissiption. The trget processing cpbility is set to one TOPS (Ter- Opertions-Per-Second) in the future with mximum power dissiption of less thn 500mW. This specifiction on power dissiption comes from the power of n nlog TV receiver LSI, ever used in n ctul cell phone. The rchitecturl considertion here is bsed on the observtion of LSI technology. When the up-to-dte LSI process technology is pplied to one of first genertion DSP chips which ppered bout qurter century go, round 000 DSP processors re estimted to be mounted on single LSI chip. This number is lrger thn tht of NTSC video pixels in horizontl line. When element processors re rrnged in the liner rry form, the ssigned re to every element processor becomes less thn one in line. Some video processors[][2] bsed on lrge grin prllelism hve employed independent element processors in liner rry, but they cnnot be used nymore, becuse of inefficient sptil processing in their ssigned res. The rchitecture presented here employs fine grin prllelism, bsed on our old reconfigurble multi-dsp system[3]. However, the resultnt rchitecture cn be considered to be simplified version of iwrp [] without messge pssing communiction for dt blocks. As the employed pproch is to use very slow clock speed of the erly stge DSPs nd lso is to use short communiction pths, brodcsting over smll number of element processors in different res cn be estblished s well s pipeline communictions. Such brodcsting function is referred to segment bus communiction, herefter. The combintion of pipeline connection nd bus connection is shown to be useful to relize complex functions such s n lgorithm of block-mtching motion detection. Other pplictions on segmented buses re for filtering nd trnsformtion. Some considertions on different rchitecture is lso briefly included. 2. NUMBER OF PROCESSORS IN A CHIP Long time hs been pssed since the introduction of single chip progrmmble DSPs, which re now commonly used in cell-phones. DSP chips in the first genertion of 80 s hd rel time processing cpbility of speech nd udio signls. For exmple, upd77c25, introduced in mid 980s from NEC hd 60 MOPS(Meg Opertion Per Second) processing cpbility with 0MHz clock frequency.

Prmeter Word-Length Instruction ROM Dt ROM Dt RAM GOPS Power Dissiption DATA 6 bits 208 words(2bits) 02 words 256 words 0.06 25 mw () Typicl DSP Prmeters in 80 s (b) 256 processors in chip with 25 mw of Power Fig.. 256 processors within chip with 25 mw of Power ()Mesh Arry DSP Gtewy DSP with Hlved Processor Resources (b) DSP with Gtewy Fig. 2. Mesh Arry cn support short communiction pths. Clock frequency is still set to 0 MHz, nd smll cpcity increse leds to very low power multiprocessor less thn 50mW. The chip dissipted 25mW of power with 6 x 6 bits multiplier, 2K word instruction ROM nd 256 word dt RAM shown in Fig (). The up-to-dte design rule is now 0.09 um to 0.06 um. Therefore, if this processor ws re-designed by the technology, mny 77C25 DSPs would be implemented on the sme die re. Let us evlute such multi-dsp chip in terms of the processing cpbility nd its power dissiption. The evlution is bsed on the following LSI device issues. I) CMOS-LSI power consumption P is estimted by P=CfV 2, () where C is wiring cpcitnce in chip, f is clock frequency nd V is power supply voltge. II) In every genertion of the LSI process technology, twice s much s trnsistors cn be integrted in the sme die re. Eight genertions of the LSI process technology hve pssed since upd77c25. Therefore, 256 processors re within the sme die re, s shown in Fig. (b). Power supply voltge t tht time ws 5V, but this voltge is for the users convenience for connecting it to peripherl TTL chips. Nowdys, one volt power supply is commonly used. Therefore, when the clock speed is kept to 0MHz nd the supply voltge is set to one volt, these 256 processors in chip consume only 25 mw of power. In ddition, die res in recent high-end chips re round to 8 times wider thn tht of LSIs for consumer mrket in 980s. Therefore, 02 to 208 processors in chip re resonble nd they will consume only 00 mw to 200 mw of power. As upd 77C25 cn process 6 opertions simultneously by VLIW instructions, the processing cpbility reches 60 GOPS to 0.2 TOPS. Although TOPS reliztion in chip my be possible by using higher clock frequency, such s 200MHz, it will result in 2W of power. Therefore, the clock speed is set to be 0MHz, herefter. As our gol is the power consumption of less thn 500 mw, the estimted power dissiption is well stisfied, with mple mrgin. In terms of power dissiption, the only one ssumption employed here is lek current problems on trnsistors. However, such problems will be solved soon by introducing some LSI device technology, including silicon on insultors. 3. LOW POWER INTERCONNECTION Although 02 to 208 DSP is ble to be implemented on chip, interconnection mong them will cuse high power consumption. The interconnection pths should not be used t high speed nor be long, which increse the frequency nd the wiring cpcity in Eq. (), respectively. Introduction of 0 MHz common bus, covering ll the processors, will gretly increse wiring cpcitnce, nd it becomes very high power consumption. The only one possible pproch not to increse power consumption so much is to employ connections mong only neighboring processors, shown in Fig. 2(). In order to do so, the element processor is first divided into two prts: DSP nd I/O gtewy, s shown in Fig. 2(b). [3] As ll the DSP portions nd Gtewy portions work t 0 MHz, the employed rchitecture becomes Systolic rry, when every pixel in picture is ssigned to n element processor in the similr rrngement to the rry. Dt trnsfer mong neighboring processors is relized by word-by-word pproch. The processing on pixel bsis, such s bckground prediction[5][6], is crried out by only DSP portions, but the collbortion between the gtewy prt nd DSP prts mkes them systolic rry.

By-Pth- Block(K) Block(K) (b) Multiple Segmented Buses. Dely Time within 00 nsec. DSP By-Pth-2 (b) Gtewy Chip Fig. 3. By-pth llows Segmented Buses used within ~8 processors. Thnks to the systolic rry pproch, this rry processor is free from Von Neumnn bottleneck of memory bndwidth problems: once dt is fetched from the min memory, processing is consecutively crried out on this dt. In terms of processor resource efficiency, the SIMD pproch generlly supports wide rnge of complex, but low level signl processing. However, the rchitecture uses independent element processors. This is becuse the long instruction bus mong processors cuses high-power consumption, due to lrge wiring cpcity. In such low level signl processing, the sme resident progrm is prepred on every different processor. The employed pproch is modified version of the systolic pproch by estblishing segmented buses, brodcsting over smll number of element processors. The segmented bus is configured by employing dt bypss switching functions through the gtewy prt in every element processor, s shown in Fig 3(b). Then, by controlling these switches, plurl number of element processor zones cn be estblished in chip, s shown in Fig. 3(). In ddition, the by-pth switch control is crried out by softwre in gtewy prt of the element processor, the formtion of element processor zones cn be reconfigurble by the collbortion of plurl number of element processor. However, this modifiction surely increses power dissiption, due to bypss line cpcitnce. In order not to increse power dissiption, the segmented bus length is limited so s to mke the worst propgtion dely in every segmented bus within 00 nnoseconds, which is equl to the 0MHz clock period intervls. In ddition, when bus structure is not required, the connection of by-pth function is switched off by softwre, nd therefore the dditionl power consumption for by-pth becomes negligible. The number Time 2 3 2 3 Idle Steps: Mx 2K Informtion Delivery By Pipeline () Pipeline Connection 2 Pixel Position Idle Steps: Mx K 2 3 2 3 3 Informtion Delivery By Segmented Bus (b) Segmented Bus Introduction Fig.. Segmented Bus supports cutting idle steps. of element processors connected to segment bus is now set to or 8.. Effectiveness of Segmented Bus There re two resons to employ segmented buses. The first reson is tht the rry processor should be progrmmble one. Due to the pipeline timing in order, tsk is crried out in series of element processors, nd it works well s fr s series of processing tsks re in the form of tndem connection, s shown in step to step 3 in Fig (). Informtion gthering is lso possible in this form, strting from the left most processor to the right most processor. After informtion collection, the right most element processor cn mke some decision. Such decision hs to be sent bck to ll the element processors in zone. After tht, progrm sequence cn be modified by the decision. This mens tht systolic connection in progrmmble pproch results in long delyed jump, s shown Mx 2K in Fig. (). However, if segment bus is vilble, s shown in Fig. (b), decision, mde in the right most processor, is informed directly to ll element processors by step. When n imge processing lgorithm employs lrger block size of segmented bus zone, tndem trnsfer of decision hs to be required, but the segmented bus introduction ccelertes informtion delivery. Therefore, it results in short delyed jump. Note tht the idle steps re not so hrmful in cell phone LSI, becuse idle DSPs in element processors cn be mde to sleep for while, nd it becomes low power. The segment bus introduction nd low power dissiption re trde off between vilble dynmic steps nd low power implementtion. The employed rchitecture cn be dpted on this trde off problem by softwre.

N t time N A pixel in reference picture is put on bus Time Time 2 Time 3 Time i bi + n, i =,2,, 2 2 3 3 3 5 5 5 5 6 6 6 6 K Element processor holds pixel in current block Memory With Corner Turner High speed bus/ bunch of wires Temporl storge my be required. 2 3 2 22 32 3 x 23 x2 33 x3 K Accumultion of i b i + n i= Fig.5 Segmented Bus elimintes skewed dely Every element processor produce bsolute difference just in time when the pipelined ccumultion reches. The second reson is tht some systolic lgorithms do not lwys exclude bus employment, thnks to the expnsion of the systolic rchitecture in ctul ASIC implementtion. The proposed system mkes use of such lgorithm. Let us check such sitution in full-serch, block-mtching motion detection. This function is composed of three prts: n bsolute difference clcultion prt between pixels in block to be evluted nd pixels in reference pictures, n ccumultion prt of such differences for L norm formtion, nd minimum L norm decision prt mong the clculted L distnces over predetermined serch re in the reference picture. The corresponding block position to the best mtch L norm shows the motion to detect. The pproch shown in Fig. 5 efficiently relizes the bove three functions by using bus input delivery to element processors in pipeline, where every processor performs bsolute clcultion nd its ddition to interim L norm clcultion. The ltter function is crried out in pipeline mnner. The following describes this processing. The bsolute difference opertions between pixels{ i } nd reference pixels from j-th position to J+K for the j-th L norm opertion re: i - b i+j, i=,2, K. The interim ccumultion t the i-th processor is crried out by dding the i-th bsolute vlue to the interim ccumultion result, sent from the preceding processor. As the j-th norm hs been completed t the right most processor, the processor compres the j-th norm with the minimum L norm which hs been selected from the lredy clculted norms. Note tht connection pths between element processor re ll single chnnels. Therefore, the time shring is required for the bsolute () Fst Algorithm bsed Architecture (b) Systolic like rchitecture with bus functions. Fig.6 Fst Algorithm Introduction requires high speed busses. clcultion nd the interim norm ccumultion. As result, the K length block, compred with the sme length block in the serch window with plus/minus p pixels, cn be processed by 2(K + p) steps. Note tht during this opertion steps, K norms re simultneously clculted. The bove explntion is concerning bout the one dimensionl cse. Two-dimensionl motion detection is just simple expnsion of this pproch. However, ctul segmented bus cpbility does not support direct implementtion, becuse only to 8 processors cn be connected with the sme segment bus. Pipeline processing of tndem segmented bus trnsmission requires dditionl 2 to more steps. Anywy, the segmented bus introduction contributes the efficient motion detection processing. In ddition, fst mtrix vector multipliction is lso efficiently relized by segmented bus introductions. Consider the mtrix nd vector multipliction, where n N x N mtrix A=[ ij ] nd n N element vector X=[x j ] produce the product of n N element vector [b j ]. When every element of vector X, for exmple x, pper on segmented bus, ll element processors in zone cn strt i-th element clcultions ( i x ) in N corresponding processors, simultneously. The ccumultion for every b j is ccomplished within every processor. Therefore, only N steps re required for completing the product clcultion. Although the introduction of segmented buses supports efficient processing of mtrix vector multipliction or trnsformtion, fst lgorithms re not pplicble in this rchitecture. However, the proposed rchitecture hs lot of ttrctive fetures. Figure 6 shows processor bsed on the FFT lgorithm nd vector trnsformtion by our rchitecture. Consider tht one dimensionl vector dt is fed into the first liner rry processor in Fig. 6(). The first rry processors

Lines DSP with 60 bnks of word memory 20 lines 352 words Frme Memory Single TOPS DSP Fig. 7. System exmple for SIF picture Processing Multiplexed Memory Interfce send processed dt to the second rry through shuffling functions, which re implemented by set of switches nd long dt lines. Switching functions nd long lines cuse the increse of ctive res nd wiring cpcity. These fcts led to high power consumption, lthough these functions re not rithmetic opertions nd therefore these functionl units re not clculted in the fst lgorithm complexity. In ddition, shuffling units re required t every liner rry output. Further more, ll the processed dt should be stored in bit reverse order, lthough one dimensionl processing cn be terminted in log 2 N rrys. A corner-turner unit hs to be prepred, if two dimensionl processing is required. On the contrry, in the proposed rchitecture, every element processor holds DFT frequency component in the frequency order fter N steps, s shown in Fig. 6(b). No bit reversl ordering is required. In ddition, when two dimensionl processing is intended to be crried out fter one dimensionl processing, every element processor sets verticl segmented buses nd puts the one dimensionl DFT spectrum on the segmented buses in order. The processing cn be crried out in the similr mnner of the one dimensionl horizontl processing. No dditionl hrdwre is required. 5. PROCESSOR ARRAY CONFIGURATION As ech frme intervl of video pictures is 33 m seconds nd the clock frequency is set to 0MHz, 330,000 plne steps re vilble during this intervl, if the number of the element processors is the sme s tht of pixels. Let us consider tht NM pixels exist in picture nd only V element processors re within the chip. Then, every element processor hs to be used MN/V-times in time shred mnner. At the moment, round 2,000 processors re vilble, but soon it will become more thn 8,000 fter two more LSI process genertions. The introduction of 20MHz bus nd 8000 element processors will chieve one TOPS. In order to mintin the sme rchitecture for future expnsion, the introduction of simple sclbility of multi-processor rchitecture is considered in Fig. 7, where SIF picture processing of 352 x 20 pixels is used. The system is composed of n rry processor hving 352 x element processors nd externl memory. The 60 times shring within the frme intervl will chieve whole picture processing. Every DSP holds processed dt nd exchnges pixels in the externl memory s I/O opertions. As the internl RAM in DSP is 256 words, 60 times shring for one frme period results words in every shred re. Every time when lines hve processed in the rry, every DSP exchnges internl 60 memory bnk with words. In terms of processing steps vilble, ech x 352 pixels enjoys 5,500 steps. As the rry processor hs only 00 element processors, round 75 mw of power is estimted. In order to evlute the cpbility of this system, let us consider the full serch motion detection within the rnge of [-8, +8] in both horizontl nd verticl directions. The block size is set to 6x6. As the block is times lrger thn the element processor re, 22,000 steps re vilble for the rel time processing. The pproch shown in Fig. 5 sks 2(6+6) steps for 6 norms in verticl direction nd 6 different reference points hve to be set for 6x352 res. Therefore, the required steps re 2(6+6)x6=736. As 22,000 steps re vilble in the rel time processing, the full serch motion detection cn be relized within % of totl processing cycles. This cycle estimtion is little bit higher thn the theoreticl mount of processing, due to the pipeline delys. Creful design of pipeline will rech the level of theoreticl result. Although the rchitecturl considertion strts with progrmmble low power HDTV processing in the future, the ctul cpbility of the multi-processor, right now, is estimted to be round 0.06 to 0. TOPS, nd lso the efficiency of the segmented bus introduction cn be shown in the line liner rry rrngement. Therefore, the SIF picture evlution hs been performed. In term of HDTV processing, the rchitecture will be pplicble to simple processing by designing very simple DSP prt for incresing the number of element processors. Therefore, the chip now designing[7] is bredbord one, hving progrmmble network with simple rithmetic functions. However, the simplifiction of the DSP prt gretly depends on the pplictions, due to the limittion of the internl memory cpcity. Note tht reconfigurble pproches, such s FPGA bsed design, re gthering lot of ttention, now. However, creful plcement of functionl blocks is required for obtining high performnce. Mny softwre

engineers do not like to do such optimiztion process, whenever they modify softwre. 6. Appliction As the direct rchitecture ppliction to HDTV processor requires two more genertions, our cndidte ppliction by simplified processor is concerning bout nother project on the re-brodcsting of digitl terrestril TV progrms through K bnd multi-bem internet stellite. This project is for eliminting ll uncovered res such s villges on high mountins nd vlleys or on islnds, in order to terminte nlog brodcsting in the yer of 20 in Jpn. The problem there is tht 23 Mbps informtion of terrestril TV brodcsting, including Mbps MPEG-2 HDTV, should be compresses to 0Mbps, due to some economicl resons. Therefore, Mbps MPEG-2 HDTV hs to be compressed to 5 Mbps. The encoding lgorithm of H.26 is sid to chieve the sme qulity video of MPEG-2 by the hlf bit rte. However, the qulity by H. 26 on softwre encoder, right now, is reported to be required round 0 Mbps for the brodcsting purpose. The criteri employed is tht DSCQS degrdtion is within 2 % over 2/3 contents. Moreover, the qulity is expected to be similr to tht of HDTV progrms from HD-DVD or Blue Ry Disc, becuse people in the uncovered res cn enjoy these medi. In ddition, most of people enjoy wtching TV progrm on lrge screen flt pnel TV, in the future, with only meter prt or so. Artifcts nd resolution drops re very esy to be recognized in such conditions. On my project, the sme qulity of Mbps MPEG- 2 brodcsting is relizble for 70% of dy, by shring common TV progrms from the mjor network brodcsting compnies. The remining 30% is for originl progrms by the brodcsting compnies in rurl res. Therefore, such progrms hve to be compressed into 5 Mbps. In such bit-rte, even H.26 produces lot of blocky rtifcts nd color shifts. For such time period on our project, similr pproch of the multiple description coding is going to be pplied, where only even fields re prepred in the TV progrm prt (mny people sid tht this resolution is clled hlf D). Only for the dvertisement prt, full frme rte HDTV is plnned to support sponsors in order to relize free of chrge TV brodcsting. The dt brodcsting prt is replced by the odd field informtion prt for only the dvertisement time period. Although this system is simple, this system is working well. Not so sever opinions re obtined in pictures, hving lrge motions or pnning pictures. However, smll motions or slow pnning scenes generte rtifcts, which is esy to be recognized. The first chip ppliction re my be these res, becuse there is enough time to the finl decision of the yer 20, nd the processor will be progrmmble. The considertion will cover the conversion of 5 frmes per second to 30 frmes per second, for covering wide pplictions of IP networks s well s elimintion of uncovered res through stellite re-brodcsting. However, such conversion results in lot of gwkiness in picture. Some conversion schemes, using object extrctions, re lso exmined to produce good HDTV pictures, but even if the object is clerly extrcted, the motion of full frme pictures produces gwky motions nd hrd to see, t the moment. 7. CONCLUSION TOPS-DSP rchitecture on imge nd video processing hs been studied for future mobile pplictions. As the chip surely contins lot of element processors for low power reliztion, systolic like rry processor pproch hs been proposed with segmented busses. The employment of pipeline nd prllel processing hs been shown to be effective for future TOPS-DSP. Although most of fst lgorithms re not effective, the proposed processor is shown to hve lot of ttrctive fetures. Full serch motion estimtion hs been shown to occupy only smll percentge of dynmic steps. REFERENCES [] Ichiro Tmitni, Hidenobu Hrski nd Tko Nishitni, A Rel-Time HDTV Signl Processor: HD-VSP, IEEE Trnsctions on Circuits nd Systems for Video Technology, Vol., No., pp.35-, 99 [2] S.Kyo, T.Kog, S.Okzki, I.Kurod.: "A 5.2GOPS Progrmmble Video Recognition Processor for Vision bsed Intelligent Cruise Control Applictions", IEICE Trns. on Informtion nd Systems, Vol.E87-D, No., pp.36-5, Jn. 200. [3] Jcob Levison, Ichiro Kurod nd Tko Nishitni, A Reconfigurble Processor Arry with Routing LSIs nd Generl Purpose DSP, Conference Proc. IEEE ISAP 92, pp.02-6, 992. [] Jmes Sutton nd Pul Wiley, iwrp: 00-MOPS, LIW Microprocessor for Multi-computers, IEEE Micro,Mg. pp. 26-87 June 99. [5] K. Toym, J.Krunmm, et l, Wterflower: Principles nd Prctice of Bckground Mintenmce Conference Proc. ICCV pp.255-26, 999. [6] Chris Stuffer nd W. E. L. Grimson, Adptive bckground mixture models for rel-time trcking, Conference Proc. CVPR, Vol.2 pp. 26-252 999. [7] Hiroshi Suzuki, Tko Nishitni, Hchiro Fujit, Study on TOPS-DSP Architecture, IEICE Monthly Reserch Meeting on Circuits nd Systems, Jn. 2007.