An Architecture for MPEG Motion Estimation

Size: px

Start display at page:

Download "An Architecture for MPEG Motion Estimation"

Paulina Ryan
6 years ago
Views:

1 An Architecture for MPEG Motion Estimation Zandonai, Diogo; Carro Luigi; Bampi, Sergio; Suzim, Altamiro {diogo, carro, bampi, Abstract This paper presents an architecture developed to perform the motion estimation into the MPEG algorithm. Among the different tasks evolved in MPEG algorithm the motion estimation is the one that requires more computing power. The proposed architecture consists on parallel processors dedicated to compute the difference between two numbers (that represent the pixel s luminance). These processors can compute the difference and accumulation in just one clock cycle. The I/O structure is intelligently designed to guarantee occupation of the processors and reduce the memory bandwidth to just 8 I/O bits. The implementation was initially described in the VHDL language, RTL level. The clock speed reached is about 8MHz due to FPGA limitations. The architecture was designed to allow its replication and parallelism until reaching the necessary performance. Resumo - Este artigo apresenta uma arquitetura desenvolvida para realizar a estimação de movimento dentro do algoritmo de compactação de vídeo MPEG. Dentre as diferentes tarefas envolvidas neste algoritmo a estimação de movimento é a que requer maior esforço computacional. A arquitetura proposta consiste de processadores dedicados operando em paralelo para realizar a diferença entre dois números que representam a luminância dos pixels. Estes processadores implementam o cálculo da diferença e acumulação do resultado em apenas um ciclo de clock. A estrutura de I/O é inteligentemente projetada para garantir a ocupação no tempo dos processadores e reduzir a largura de banda da memória para apenas 8 bits. A implementação foi inicialmente descrita em linguagem VHDL nível RTL. A freqüência de clock alcançada é da ordem de 8MHz devido às limitações do FPGA. A arquitetura foi pensada para permitir sua replicação e paralelismo até se alcançar a performance necessária. 1. Introduction There are in the market today a lot of general purpose, RISC, CISC and DSP processors that are able to perform most of the usual computational tasks like text editors, electronic planning, filtering and others. These processors are implemented in silicon using standard cell or full custom layouts. These kind of technology are very faster than a FPGA implementation and allow to reach a clock frequency from 200MHz to 1GHz. The processors market is extremely competitive in terms of physical technology, but when discussing the ideal architecture there is no consensus. There is a number of tasks that require a large computational power and can t be performed by any kind of processor cited above, even those running at 1GHz. Motion estimation is one of these tasks, so we used a parallel and optimized architecture to compensate the deficiencies in terms of technology. The problem attacked is the video compaction algorithm MPEG [PEN93]. This algorithm consists of three essential blocks that are: ME Motion Estimation; DCT Discrete Cosine Transform; VLC Variable Length Codification; Table 1 shows the computational power required to perform the motion estimation for some usual standards. Standard Video Conference Blocks (8x8) Blocks (16x16) GOPS (8x8) Table 1 - Computational Power for ME GOPS (16x16) ,95 7,31 NTSC ,42 50,37 HDTV ,73 276,77 That s clear in this table that the motion estimation is a critical task into the MPEG algorithm and needs something like 300GOPS (giga operations per second) to be performed in real time. There are high performance DSP processors that can reach until 200GOPS taxes in ideal conditions, but the taxes required for the motion estimation are even greater what justify the development of a dedicated architecture to implement the motion estimation. The paper initially presents the problem of motion estimation and the limitations that the algorithm imposes to the architecture. Chapter 3 explains the architecture, memory organization and dataflow structure. Chapter 4 shows the double memory strategy and details of the VHDL description of the Processing and Comparing unities. The last section presents the results and conclusions

2 associating the obtained clock frequencies to physical aspects of the architecture. 2. Problem Definition The MPEG algorithm gives to the decoder the freedom of choice in terms of size of Search and Reference blocks. The decoder is also free to implement any kind of algorithm to decide what block better matches with the reference. So come to define the way we decided to attack the problem. We considered the motion estimation applied to 8x8 pixels Reference blocks searching into the interval [-4,3] pixels. This results in a search window of 15x15 pixels. During the block matching, what means to find the block that better closes with the Reference block, most pixels will be used multiple times. So we can hope to reuse the data read from memory trying the improvement of the computational power and reduction of the memory bandwidth necessary to data input. For each Search block 15x15 pixels there are 64 candidates to motion vector. For each candidate will be computed 64 partial distortions, so a total of add/sub must be computed to each motion vector estimated. The architecture developed is able to perform 8 distortions in parallel what means 512 processing cycles to each motion vector computed. Reference Block Buffer 3. Proposed Architecture The proposed architecture is inspirited in [BHA99]. The figure 1 shows the proposed architecture that uses an array of 8 processors working in parallel. Each element of the array is called processing unity or simply processor. This array is able to compute 8 distortion values in parallel these 8 values refer to the distortion evaluated for blocks displaced one row to each other. When the computation of the 8 initial values is finished, the second line is taken and 8 new values will be computed. To each candidate to motion vector 64 distortions are associated. These distortions will be accumulated in to compose the total distortion. The candidate to be motion vector that presents the lowest total distortion will be considered the motion vector. The memory address of the first pixel of this block will be the output information of this hardware. With this information a decoding algorithm can understand where to find this block. The parameter used in this hardware to determine the block that better fits with the reference is a derivation of the MAE (mean absolute error). Once the distortions are computed considering just the absolute difference between two pixels the average value isn t calculated, instead the values are integrated Search Window Buffer R S1 S2 R0 R1 R7 Dref Din Dref Din Dref Din Comparing Unity Figure 1 Architecture for motion estimation

3 to compose the total absolute error. In practical the block that presents the lowest average distortion will present the lowest total distortion too, so we need less computational effort aborting the average calculation Memory Once the search window width is twice the reference s we can improve performance using a dual port memory to the search window. So the search window will be divided in the left and right sides. The port S1 receives data from the left side while the port S2 receives data from the right. The data from both ports are distributed to all processors. Each processor has a multiplexer at the input to allow choosing the port where data come from. The reference memory is an 8x8 block, simple access. The data read from port R passes at each clock cycle through the registers R0, R1,, R7. This allows each processor to use at each clock cycle a different data, even the reference memory been simple access. Defining memory implies same limitations in architecture. So come to see how the operations were sequenced to allow its execution at the moment the operands are available Data Flow The table 2 shows in details the data flow to the architecture proposed in figure 1. The pixels from Reference and Search memory are processed sequentially from the left to the right. Eight blocks are processed in parallel. To a right vertical displacement I = 0,1,2 7, P0 computes D(i,0), P1 D(i,1),, P7 D(i,7), so each processor is computing a window displaced one row to the right relation to the anterior. The pixels from the reference block are displaced through the processors using serial registers. Once a pixel is available to P0 at time t it will be available to P1 at time t+1. So that s necessary 8 cycles to the first data reach the P7 processor, so just after the 8 th clock cycle the architecture will be working fully in parallel. At time t=0 P0 computes acc 0 = r 0,0 s 0,0. At this moment data are not available to the other processors that are standby. At time t=1 P0 computes acc 0 = acc 0 + r 0,1 - s 0,1. The pixel r 0,0 is now available to P1 which computes acc 1 = r 0,0 - s 0,1. Realize that P0 and P1 are using the same data from Search memory, so only the reference data goes through the registers P0, P2,, P7. Cycle Memory Processors Time R S1 S2 P0 P1 P7 0 R(0,0) S(0,0) R(0,0) - S(0,0) 1 R(0,1) S(0,1) R(0,1) - S(0,1) R(0,0) - S(0,1) 2 R(0,2) S(0,2) R(0,2) - S(0,2) R(0,1) - S(0,2) R(0,7) S(0,7) R(0,7) - S(0,7) R(0,6) - S(0,7) R(0,0) - S(0,7) 8 R(1,0) S(1,0) S(0,8) R(1,0) - S(1,0) R(0,7) - S(0,8) R(0,1) - S(0,8) 9 R(1,1) S(1,1) S(0,9) R(1,1) - S(1,1) R(1,0) - S(1,1) R(0,2) - S(0,9) R(1,6) S(1,6) S(0,14) R(1,6) - S(1,6) R(1,5) - S(1,6) R(0,7) - S(0,14) 15 R(1,7) S(1,7) S(0,14) R(1,7) - S(1,7) R(1,6) - S(1,7) R(1,0) - S(1,7) R(7,0) S(7,0) S(6,8) R(7,0) - S(7,0) R(6,7) - S(6,8) R(6,0) - S(6,7) 57 R(7,1) S(7,1) S(6,9) R(7,1) - S(7,1) R(7,0) - S(7,1) R(6,1) - S(6,8) R(7,6) S(7,6) S(6,13) R(7,6) - S(7,6) R(7,5) - S(7,6) R(6,7) - S(6,14) 63 R(7,7) S(7,7) S(6,14) R(7,7) - S(7,7) R(7,6) - S(7,7) R(7,0) - S(7,7) 64 S(7,8) R(7,7) - S(7,8) R(7,1) - S(7,8) 65 S(7,9) R(7,2) - S(7,9) S(7,14) R(7,7) - S(7,14) Table 2 Data Flow

4 At time t=7 r 0,0 reaches P7, the last processor, which computes acc7 = r 0,0 s 0,7. Since this moment forward the processors start to be fully used. At time t=8 the three memory ports are active. The port R receives the first pixel of the second line of the reference block, S1 receives the first pixel of the second line of the left side of the search memory and S2 receives the first pixel of the first line of the right half of the search memory. At this moment P0 computes acc 0 = acc 0 + r 1,0 - s 1,0 starting the computation of the second line while P1 computes acc 1 = acc 1 + r 0,7 s 0,8, this mean, each processor from P1 to P7 will compute in this cycle acc j = acc j + r 0,8-j s 0,8, where j = 1, 2,..., 7. At this cycle the inner multiplexer of the processing unity will change only to P7 allowing data from S2 to be the operand of P7, in the following cycle the P6 multiplexer will change and so on. The control logic must manage the signals to control the multiplexers to each processor. At time t=63 P0 computes the last pixel of the reference block. The processing follow until t=71 when the last data reaches P7 which finishes the computation making acc 7 = acc 7 + r 7,7 - s 7,15. At cycle t=64 P1 ends it s computation and the comparing unity, that appears at the button of the figure 1 can start the comparing of D(0,0) with D(0,1). The comparing unity needs only one subtractor once the distortion values are available cycle after cycle and can be processed serially. The control machine must analyze the results of the comparing unity and decide what will be the motion vector. The processing to the next conjunct of distortions can start at t=64 once that at t=63 the processor P0 ends the calculation of D(0,0). It s possible to realize that the computing of the distortions D(1,0), D(1,1),..., D(1,7) follows the same sequence of operations above described, but the time to fill the registers Ri is in pipeline with the comparing of the distortions D(0,0), D(0,1),..., D(0,7). The operations order is represented in Figure 2 through gray tones. Three cycles of 64 operations are represented; the first cycle starts at line 0, the second at line 1 and so on. These cycles happen serially in time and are started at time represented at left side of the figure. The darker windows at figure 2 start the computation after the brighter such a way that the computation of the distortions to each window 8x8 from left to right is later one cycle of clock relation to the one at left. T=0 T=64 T= T=448 Figura2 Gray tones representation of the operations sequence 4. VHDL Description The VHDL description is RTL level. The basic components are the Memory blocks, the Comparing unity and the Processing unities. The circuit was synthesized for the Altera s device EPF10K20RC208-4 from family FLEX10K20. We have chosen an FPGA implementation just for the sake of prototyping. Of course, a Standard Cell implementation would give even better result. Interestingly the performance achieve using FPGAs is enough for the target application Memory The memories were implemented using the Altera mega functions lpm_ram_dq. The synthesis device has 6 memory blocks each one with 256 bytes. The data used to motion estimation are the luminance pixels that are 8 bits (1 byte) such way that this memory has the ideal width. The 6 memory bands were used, but any band was fully used and there are holes in every band. The 4 memories used to implement the halves of search window were used ½ while the two memories used to the reference block were used ¼. The circuit implementation has double of the necessary memory. The memory duplication allows the circuit to implement the reading of the blocks S11, S12 e R1 while the blocks S21, S22 and R2 are been filled for an external circuit and so on. The memory was organized with the following blocks: T=512+8

5 R1 Reference Memory 1; R2 Reference Memory 2; S11 Search Memory 1 Right Half; S12 Search Memory 2 Right Half; S21 Search Memory 1 Left Half; S22 Search Memory 2 Left Half; The control machine generates the memory addressing. To do that the control machine uses a counter incremented at each clock cycle. The Reference memories are 6bits address while Search memories are 7bits. So the counter was implemented with 7bits but the reference blocks receive only the 6 less significant bits, for an example: when Sadd=63 Radd=63 and when Sadd=64 Radd=0. An adder generates the address to the right side of the search memory adding the constant 8 to the left side address. The write enable (WE) signal is the same for all the memory blocks but the right side memories receive this signal inverted. Every 64 cycles this signal changes its state. While the blocks 1 receive WE= 1 the blocks 2 receive WE= 0 cause of the inverter and the situation is inverted each 64 cycles Datapath Block The datapath block of the motion estimator corresponds to the VHDL description of the architecture in figure 1. The aim blocks are the Processing unties and the Comparing unity. The inner blocks of the operative block were made using same components offered for the synthesis tool (Altera) like memory and adders. The memories are asynchronous, so to implement a writing the WE signal must be up only at the time the address and data are valid, to the reading there is a combinational delay after the address is valid. The architecture at all gets a few I/O pins what was an initial requirement. The I/O functions are just the data input, motion vector output and a few control signals. The processing unity must be optimized because it is the basic unity and it is just what will be replied to implement more powerful circuits. The architecture implemented is proper to make the operation acc=acc+ R(i,j)-S(i,j) using an unique combinational logic. It was shown that this delay is not in the critical path of the circuit. The table 3 shows that this delay is the same order of the delay caused for the Comparing unity. If the reduction of the processing unity delay were directly associated with the circuit maximum delay it would be valuable to insert a register between the two add/sub, it means, to use a pipeline technique. In this project many adders were used to implement subtractions of two binary operands that are represented in two complement. To this subtraction it s necessary just invert one operand and set the carry signal to 1. The architecture of the Processing unity is represented in figure 3. This unity gets one 8bit adder, one 14bit adder, one 14bit register, logical gates and multiplexers. S1 S R 8 dif dif(8) + 14 acc Ri Figure 4 Processing Unity R i+1 The first adder computes the difference R(i,j)-S(i,j). If this difference is negative it will be subtracted from the adder value and if it s positive it will be added, such a way that always the difference s modulo will be taken once (-Dif) = +Dif. The signal bit of the first add/sub result was used to control the XOR gates (controlled inverter), this signal also is the carry signal of the second adder. The accumulator can be initialized, the reset must occur at the beginning of each distortion computing. The second adder and the accumulator was implemented in 14 bits to support a difference of Dif=256 accumulated by 64 pixels.

6 The Comparing unity was implemented like shown in figure 4. The comparing unity gets a 14bits adder and a 14bits register, inverters and multiplexers. Their function is, at each clock cycle, to compare acc i+1 with acc i and keep the lower value. Every time this unity changes the value of Lower (when acc i+1 > acc i ) the control machine must actualize the MV register, which keep the motion vector, with the address of the 8x8 block that is been computed acc 0 1 Low 14 s 0 S(14) reset Figure 5 Comparing Unity 4.3. Control Machine There is some control signals that happen periodically every 8 clock cycles, others must be 1 only at the 8 initial and/or final cycles To resolve this problem was implemented a state machine with 8 states and an auxiliary counter that signalizes macrostates. This counter is 4bits and counts from 0 to 15. This counter is incremented by a clock that is 1/8 of the clock signal of the circuit such way that the value of this counter represents a superstate of the control machine. Same signals are controlled with period 8 for the state machine and others are controlled with period 8*16=128 for the counter. 5. Results and Conclusions The synthesis results are shown at table 3. The memories don t use logical gates once it s addressing is made for the control unity. The operative part represents 90,8% of the total area. Blocks LCs Mem Bits Fmáx Memory R Memory S UP 55-27,02 Comparing 41-26,73 Operative Control 98-69,44 Total ,32 Table 3 Area and frequency per block The circuit at all uses 62% of the device s logical gates and 41% of the available memory. This shows that this device doesn t supports two of these circuits. The clock frequency obtained for the blocks Processing unity and Comparing unity are into what was hope for this technology and device. The frequency of the project at all is below the expectative and it s attributed to the routing difficulties that the device present when improving the area usage. The performance reached is enough to real time motion estimation for usual definition video standards but for better definitions it would be necessary to improve performance. A Standard Cell layout or a greater and faster device would be fast enough such that the real time task could be performed. 6. References [BHA99] V.Bhaskaran and K. Konstantinides, Image and video compression standards - algorithms and architectures, Kluwer academic publishers, [PEN93] W. Pennebaker and J. Mitchell, JPEG still image data compression standard, Van Nostrand Reinhold, [PIT00] I. Pitas, Digital Image Processing Algorithms and Applications, Wiley-Interscience Press, [CHA97] W. Pennebaker, J. Mitchell, Chad E. Fogg and Didier J. LeGall, MPEG video compression standard, Chapman & Hall, [AGO99] L. V. Agostini, Estudo de padrões de compressão de imagens para aplicações VLSI, Trabalho individual, Programa de pós-graduação em computação, Universidade Federal do Rio Grande do Sul, Jan [TRA00] Trac D. Tran, The BinDCT: fast multiplierless approximation of the DCT, IEEE Signal processing letters, vol. 7, no. 6, pp , jun [BJO94] G. Bjoentegaard, MPEGII coding schemes for low delay requirements, ISCAS 94 Tutorials, May [MOR94] Morris John, MPEGII The main profile, ISCAS 94 Tutorials, pp , May 1994.

Microprocessor Design

Microprocessor Design Principles and Practices With VHDL Enoch O. Hwang Brooks / Cole 2004 To my wife and children Windy, Jonathan and Michelle Contents 1. Designing a Microprocessor... 2 1.1 Overview