RESEARCH AND DEVELOPMENT LOW-COST BOARD FOR EXPERIMENTAL VERIFICATION OF VIDEO PROCESSING ALGORITHMS USING FPGA IMPLEMENTATION

RESEARCH AND DEVELOPMENT LOW-COST BOARD FOR EXPERIMENTAL VERIFICATION OF VIDEO PROCESSING ALGORITHMS USING FPGA IMPLEMENTATION Filipe DIAS, Igor OLIVEIRA, Flávia FREITAS, Francisco GARCIA and Paulo CUNHA Graduate Program on Electrical Engineering and Department of Electronic Engineering and Telecommunication Pontifical Catholic University of Minas Gerais Belo Horizonte, MG, Brazil siselet@pucminas.br ABSTRACT This paper proposes an architecture of a low-cost board for research and development engineering projects on video hardware processing. The current educational market lacks platforms for such designs and the majority of published scientific works only present software simulations. The presented platform enables the verification of the video operators under real experimental constraints, which are imperative to determine performance criteria for real time processing. It is composed of three modules: (i) the Video Input Interface ( VII ), that receives the ITU-R BT.601 interlaced digital video stream and delivers progressive digital video represented by the three parallel components Y C b C r; (ii) the Video Processing Module ( VPM ), that implements a generic video processor, according to the user s need; (iii) the Video Output Interface ( VOI ), which converts video components from Y C b C r to RGB. The architecture is both technology and manufacturer independent and has been implemented and verified in the Altera EP2C70F896C6 FPGA of Terasic s DE2-70 board. The main constitutive blocks have been developed using VHDL and the simpler, most generic ones, have been designed using Quartus II s megafunctions. The VII and the VOI modules have occupied together an area smaller than one percent of the FPGA s total capacity, leaving a great area dedicated to the Video Processing Module. Keywords: Research and development on video processing, ITU- R BT.601, VGA, real time hardware processing, real experimental constraints, FPGA, VHDL. I. INTRODUCTION Nowadays, scientific works on image and video processing are usually restricted to the verification of the algorithms through software simulation and are only verified in simulation environments like MATLAB. This is due to the lack of low-cost platforms in the educational market that can enable the verification of real time video processing algorithms through experimental hardware implementations. However, the simple verification through simulation procedures does not detect problems that frequently occur in experimental tests, such as the limited quantity of bits to represent the analog signals, the processing time and the circuit latency, the memory required and the power consumption, among others. Some recent works, developed in the borderline between video Authors should thank FAPEMIG (Foundation for Research Support of State of Minas Gerais), for providing the necessary equipment and for supporting the Scientific Initiation Scholarship in the PROBIC-FAPEMIG agreement, for which authors should also thank the Pontifical Catholic University of Minas Gerais. processing and hardware architectures, have shown this preocupation [1]-[3]. Besides, image and video processing applications usually present a high level of data parallelism, a feature which makes hardware realizations very appropriate for implementation, in detriment of software realizations, whose processing is normally sequential. This paper proposes a low-cost platform for scientific research and technological innovation projects on still image or video real time processing. The platform receives a digital video signal in the ITU-R BT.601 standard [4] (which is the default output from analog-to-digital standard definition TV converters), enabling its color components (Y C b C r or RGB) to be processed by generic operators (such as encoding and decoding algorithms, image analysis, image restoration, image convolution and interpolation, among others). The platform also enables the visualization of either the input video or the processed video in a VGA (Video Graphics Array) monitor. The board consists of two digital interfaces, named Video Input Interface ( VII ) and Video Output Interface ( VOI ), and also of a Video Processing Module ( VPM ). The VII aims at adapting data provenient from the TV decoder (presented in ITU-R BT.601 format) into a more adequate representation (supporting parallel channels for Y C b C r ) in order to independently and simultaneously process video components. The processing algorithms depend on the user s need and are implemented in the VPM. The VOI is responsible for transforming the video signal represented by its Y C b C r components (from the output of the VII or from the output of the VPM ) into the VGA format, in order to be visualized in a computer monitor. Alternatively, the outputs of the VII can be connected to the inputs of the VOI, in order to adapt color video representation from Y C b C r to RGB, on which VPM can operate. That is, the VPM can choose between processing Y C b C r or RGB. The proposed hardware architecture for the board is both technology and manufacturer independent, allowing various device options for implementation. This architecture has been verified through implementation on Terasic s DE2-70 board, which already provides analog video inputs, the TV decoder ADV7180 and the FPGA (Field Programmable Gate Array) Altera EP2C70F896C6, in which both the interfaces and a simple VPM, for demonstration, have been implemented. The design has been developed in Altera s Quartus II tool suite. This realization using the DE2-70 board makes the low-cost Terasic s development board suitable for research and innovation developments in the area of real time video processing and analysis, such as computer vision, among others. The paper is organized in the following sequence: Section II presents the proposed board s architecture and a general description of the proposed input and output interfaces. Section III de-

Fig. 1. The proposed hardware architecture scribes the implementation of both the video interfaces into the DE2-70 board. Finally, Section IV shows a simple VPM that implements a one-dimensional convolution algorithm. The main conclusions of this paper are related in Section V. II. THE PROPOSED BOARD S ARCHITECTURE Figure 1 presents the hardware architecture proposed for the real time video processing board. Usually, video operators manipulate image color components separatly. For this reason, they need these components provided by distinct parallel datapaths. However, the ITU-R BT.601 standard provides three multiplexed color components in the same datapath, as shown in Figure 2 [5]. Moreover, the ITU-R BT.601 presents an interlaced video stream with 30 frames per second (each frame with 720 480 pixels). Therefore, there are two image fields captured in different times: the field constituted by the odd video lines and the field composed of the even video lines. Timing information on the ITU-R BT.601 standard is inserted into the video stream through special data sequences [5]. These sequences indicate Start of Active Video (SAV) and End of Active Video (EAV), in each video line. ITU-R BT.601 uses the H (horizontal blanking), V (vertical blanking) and F (field) control signals to compose information related to timing control: F = 0 for field 1 and F = 1 for field 2, V = 1 during vertical blanking, H becomes 0 during SAV and H turns becomes 1 during EAV (that is, H = 1 during horizontal blanking). Both SAV and EAV are constituted by the preamble and the status word. The SAV s preamble is identic to the EAV s preamble, which is formed by the following sequence of data words: F F H, 00 H, 00 H. The data word that follows these three words is the status word, usually identified by XY Z H. It is the status word that differentiates a SAV sequence from an EAV sequence, depending on the H bit. The status word is also used to inform the timing signals V (which indicates vertical blanking) and F. Signals V and F, obviously, only change during EAV sequences. A video line starts with the beginning of an EAV sequence. Next comes the horizontal blanking (with 268 samples), and then a SAV sequence follows. The active video samples only start to be transmitted after the SAV sequence. Y :C b :C r (the luminance and the two chrominances components) data is provided in 4:2:2 format (the Y component is informed for all the samples, while the chrominance components are only transmitted on the even samples) and the active video stream always begins with a C b sample. In the multiplexed sequence, the co-situated samples (those that correspond to the same pixel) are put sequentially in the following order: C b, Y, C r. No chrominance information is transmitted for odd pixels. So, during an active video line 720 samples of Y are transmitted (Y 0 Y 719 ), but only 360 samples of C b and 360 samples of C r occur (only for even samples). The final active sample, corresponding to Y 719, is also the last sample of the video line. The Video Input Interface ( VII ) receives a 27 MHz clock, a reset signal and the video stream in ITU-R BT.601 standard as inputs. This video stream is converted into a video with 60 frames per second (progressive video, instead of interlaced video), with 720 480 resolution (the same as ITU-R BT.601), at 4:4:4 format (the three components Y, C b and C r are transmitted for all Fig. 3. Interconnection among video devices and DE2-70 Terasic s board pixels), with independent and parallel datapaths (instead of multiplexed samples) for the luminance Y and for the chrominances C b and C r. In this way, the video components can be processed by a generic VPM and the resulting video can then be visualized in a VGA monitor, through the VOI. The circuit implemented on the VPM is customized to the user s needs and depends on the video application. It presents a latency of N clock cycles, which is determined by the complexity of the implemented algorithm. The synchronism signals and the blanking signal provided as outputs by this module have to be identical to the outputs supplied by the VII, but N clock cycles delayed. The second interface, named Video Output Interface ( VOI ), is responsible for transforming a video signal represented by its components Y C b C r into a video signal represented by its components RGB. These components can be delivered to a VPM that process RGB components (instead of Y C b C r ) or to a DAC (digital-to-analog converter), along with the timing signals and the synchronism signals, in order to allow the video signal to be reconstructed on a VGA monitor. Eventually, when the application is restricted to the visualization of the input video, the VPM may be suppressed and the VOI can receive the outputs signals from the VII directly as inputs. III. EXPERIMENTAL REALIZATION OF THE VIDEO INTERFACES The proposed architecture has been implemented and verified on Terasic s DE2-70 board [6] for development and educational purposes. This board has been chosen due to: its low cost (its price is USD 300 for any researcher); the multimedia interfaces supplied (such as TV decoder and VGA output); its storage devices and the large capacity of the EP2C70F896C6 Altera Cyclone R II FPGA, equipped with almost 70,000 logic elements that allow complex algorithms to be implemented. Figure 3 shows the communication mechanism among the interfaces proposed in this paper and the integrated circuits on the DE2-70(ADC - analog to digital converter, and DAC - digital to analog converter) utilized. The DE2-70 board supplies two TV decoders (ADV7180 [7]) that convert an analog TV signal into a ITU-R BT.601 video stream. The board also provides a DAC (ADV7123 [8]) that allows the connection of the FPGA outputs to the VGA port. The proposed architecture has been implemented using a modular and hierarchical approach, aiming at an easy and practical development both in the design and in the test steps. The bottom level of the architecture, which implements the constitutive blocks of the video interfaces, has been designed using either Quartus II s megafunctions or VHDL (Hardware Description Language for Veryhigh speed integrated circuits) [9]. This section first presents the

Fig. 2. Horizontal timing at a video line scanning in ITU-R BT.601 standard (4:2:2 Y :C b :C r ; 720 active samples per line; 27 MHz clock) top of the hierarchy (a schematic file) for the video input and output interfaces, for the case of monochromatic video, with detailed explanations of each block s behavior. The implementation for monochromatic video is a little simpler than the one for color video, since the luminance is used and the chrominance components are discarded. Afterwards, this section also presents the necessary modifications to adapt the implementation of the input and output interfaces for color video. III-A. Video Input Interface ( VII ) for Monochromatic Video In Figure 4, the top of the hierarchy is shown for the Video Input Interface ( VII ), in which the blocks BT.601 Decoder, Sync H, Sync V, Sample Counter and Demux lb enable have been described using VHDL. The simpler and more generic blocks, line buffer and mux line buffer out, have been developed using Quartus II s megafunctions. Since BT.601 provides an interlaced video stream, but VGA is a progressive video standard, a deinterlace operation must be carried out in the VII. In progressive video, all the lines in a frame are sequentially sent. Unlike that, in interlaced video, odd lines, taken at time t 1, compose field 1 and are sent first. Even lines are taken at time t 2, constitute field 2 and are sent after the odd lines. The deinterlace operation consists of using the fields of interlaced video to estimate the missing lines of the frame, at each time t i. In the VII, the method used to implement the deinterlace operation is line repetition. In spite of being the most rudimental available method, it can be implemented without using memory external to the FPGA, since it is possible to storage a few lines in FPGA s embedded memory. Each line buffer block implements a memory capable of storing one video line. The BT.601 bitstream is the input of the block BT.601 decoder, which is in charge of identifying the preamble sequence and translating the status words into the signals SAV, EAV and Field, all described in Section II. The bitstream is also sent to the blocks line buffer, in order to be stored. The block sample counter, through its output sample, implements a counter to provide the writing addresses for the blocks line buffer. Since only the luminance Y is taken (that is, the chrominance components are discarded), the least significant bit of the output sample of the block sample counter is used as an activation signal for the blocks line buffer. The reason for this procedure is that the least significant bit is always 1 at Y samples (and it is 0 for all the chrominance samples) and, by this procedure, only the Y samples are stored in the buffers. Since the video transmission is intended to be made in real time, reading and writing of video lines must occur simultaneously. For this reason, two line buffer blocks have been implemented. While one of the blocks line buffer is used to store the Y samples relative to a video line, the other line buffer is read in order to transmit the Y samples relative to a previous line which has been previously stored. At the start of each new video line, the function of each line buffer block is switched, that is, the input streams Y samples are written in the line buffer that was being read, and the line buffer which was being written to then starts to be read. The least significant bit of the sample output from the sample counter block is connected to the input signal of the demux lb enable block, which addresses this signal to one of its outputs, out0 or out1. These outputs are connected to the wren (writing enable) inputs of each one of the line buffer blocks, which control the reading and writing enabling of these blocks. The demux lb enable block also produces the rd sel signal, which commutes every rising edge of SAV signal (provided by the BT.601 decoder block). The rd sel signal is the selection bit of the multiplexer implemented in the mux line buffer out block, which is in charge of the correct transmission of the signals from the outputs of the line buffer blocks to the result output. In order to implement the line repetition necessary for the deinterlace process, each line buffer block must be read twice, while the storage of the following line to be read is done in the other line buffer block. The sync H block, which is responsible for the VGA video s horizontal synchronism, is also in charge of the double reading of each line from the line buffer block. It implements a counter that supplies the reading address for the line buffer blocks. So, during the time interval in which the sample counter block counts from 0 to 1715 (1716 samples the total: 4 samples of EAV sequence, followed by 268 blanking samples, other 4 samples of SAV sequence and, finally, 1440 active video samples), the sync H block counts only from 0 to 857, that is, the sync H block produces all the addresses for reading of the samples of one specific line for two times consecutively, while the sample counter block generates the writing addresses for storing each sample only once. It is important to note that the total number of addresses is the same, since the least significant bit of the sample output of the sample counter block is not used to compose the address. The counters implemented in the sample counter and sync H blocks are cleared by the SAV signal, which enforces the synchronism with the BT.601 bitstream video. The sync V block implements a counter from 0 to 524 and operates as a line counter. The sync H block has an output named ena cont V out, which remains high during the last sample of each line. This signal is connected to the sync V block and makes its counter state increment one at the clock s following rising edge. The sync H and sync V blocks are also responsible for generating the H SYNC, H BLANK, V SYNC and V BLANK signals. These signals s production is based on the count values implemented in each block. At the sync H block, the H SYNC

Fig. 4. Implementation of Video Input Interface ( VII ) signal is low between the samples 745 and 847 and it stays high at the other samples. The H BLANK signal stays low between samples 720 and 857 (non active samples) and it stays high at the other samples. At the sync V block, the V SYNC signal is low between lines 11 and 13 and it is high at the other lines. The H BLANK signal remains low between lines 10 and 44 (non active lines) and it is high at the other lines. The H BLANK and V BLANK signals are later used to produce a signal named Blanking. The Field signal from the BT.601 Decoder block is connected to the V SYNC block and a falling edge at the Field signal turns the count of the V SYNC block to the value four, synchronizing it with the BT.601 bitstream (the high to low transition of the Field signal occurs in the beginning of line 4). III-B. Video Output Interface ( VOI ) for Monochromatic Video Figure 5 shows the VOI, which is composed of only one block, which has been developed using VHDL and is responsible for the components conversion and for the contrast adjustment. Contrast is the range of possible values for video components. In the ITU-R BT.601 standard, this range is limited to 16-235. However, VGA monitors do not present this limitation and can reproduce video signals at a larger range. The contrast block converts the Y values in the 16-235 range, from either VII or VPM, to the 0-255 range, through the expression 1 and then each of the RGB components, in the case of monochromatic video, are made equal to Y, Y = (Y 16) 1, 16363636363636 (1) in which Y is the adjusted Y value and Y is the luminance value before the contrast adjustment. Figure 6 presents the implementation of the proposed platform, operating with a NTSC TV signal originated from a video camera. In this implementation, the VPM has been suppressed and the platform application has been restricted to the visualization of the interlaced video input in a VGA monitor. III-C. Modifications into the Input and Output Video Interfaces to Reconstruct Color Video In order to provide color VGA, it is also necessary to process C b and C r components. As it can be seen in Figure 2, C b and C r Fig. 5. Symbol for the Video Output Interface ( VOI ) are subsampled and an interpolation operation must be carried out to estimate the missing chrominance samples (corresponding to odd pixels). The interp block, shown in Figure 7, implements an interpolation with a Lanczos Filter that uses eight samples (four samples before and four samples after the sample to be estimated) to determine each missing chrominance sample. The BT.601 bitstream that comes from the BT.601 decoder block is the d in[7..0] input for the interp block, whose main outputs are the signals Y, C b and C r. The interp block has a latency of eight clock cycles. Therefor, it is necessary that the synchronism signals SAV and EAV are equally delayed by the interp block. Just like the Y samples, the chrominance components (that are provided for all video samples by the interp block), must be stored in order to enable the conversion from interlaced video into progressive video. Because of that, other four line buffer blocks must be included (two dedicated to store C b samples and two other to store C r samples from one video line). The signals necessary to control these blocks operation are the already existent signals used to control the storage of Y samples in the previous line buffer blocks. Therefore, the deinterlace process happens in the same way for Y, C b and C r samples. The circuit VII to reconstruct color video (which provides the Y, C b and C r components) is omitted because of its similarity to the one presented in Figure 4. The VOI for color video, consists on the YCbCr 2 RGB block, responsible for converting Y C b C r to RGB. It is shown at Figure 8 and implements the equations 2 to 4, by which a contrast adjustment is also implemented.

Fig. 8. Symbol for block YCbCr 2 RGB Table I shows some compilation results extracted from Quartus II, relative to the implementation of the two interfaces ( VII and VOI ) for color video. Fig. 6. Photograph of the implementation of Video Input and Output Interfaces Table I COMPILATION RESULTS - INPUT AND OUTPUT INTERFACES FOR COLOR VIDEO Fig. 7. Symbol for block interp Family Cyclone II Total logic elements 942/68, 416(1%) Total combinational functions 776/68, 416(1%) Dedicated logic registers 530/68, 416(< 1%) Total registers 530 Total pins 40/622(6%) Total memory bits 41, 256/1, 152, 000(4%) Embedded Multiplier 9-bit elements 0 / 300 ( 0 % ) Total PLLs 0/4(0%) R = 1.164(Y 16) + 1.596(C r 128) (2) G = 1.164(Y 16) 0.813(C r 128) 0.391(C b 128) (3) B = 1.164(Y 16) + 2.018(C b 128) (4) III-D. Experimental Results Figure 9 has been generated inside Altera s Quartus II design tool and it shows the temporal simulation of the color video input and output interfaces prototyping. Since the time interval for one video line is many times larger then a Y sample time, temporal simulation of two or more video lines lack visual details. It is possible to see in the BT.601 in signal some time intervals with many transitions, which correspond to the active video samples, and also other stable time intervals of value equal to 16 (correspondent to non active video, or blanking). During the first video line, the R, G and B outputs stay on zero state, since the video signal is being written at one of the line buffer blocks. As soon as the second video line begins to be transmitted (after the non active video period), the R, G and B outputs begin to receive the values from the previous line. If the R, G and B signals are compared to the BT.601 in input video stream, it is possible to notice that the first line is read twice during the time interval that the second line is transmitted. From Figure 9, one can see that the Blanking signal is active during the non active video interval, just like the H sync signal, which also occurs during the non active video period. IV. IMPLEMENTATION OF A UNIDIMENSIONAL CONVOLUTION MODULE IN THE VPM The purpose of this section is to implement a simple Video Processing Module to demonstrate an application that can be verified in the proposed Research and Development Board. It has been chosen a one-dimensional convolution module that operates directly over the RGB components (which are the VOI s outputs). A new S i value is calculated from the current S i sample (that can be a sample of either R, G or B components) following the expression: S i = Coef0Si 1 + Coef1Si + Coef2Si+1 256 in which clk is a 27 MHz clock, i is the column at which the current pixel is located and Coef 0, Coef 1 and Coef 2 are constants in the range -256 to 255 which depend on the type of filter demanded by the user (low-pass, high-pass etc). Figure 10 presents the temporal simulation for the block which symbol file is given in Figure 11. Figure 10 shows that there is a latency of 6 clock periods. The signals Hsync out and Vsync out correspond to the synchronism input signals delayed by this same latency period. It can also be noted that the output frame rate is the same as the input frame rate, that is, a simple video processing application has been implemented in real time. V. CONCLUSIONS The achieved results show that the proposed architecture attends the initial objective, which is providing a low-cost board aiming at (5)

Fig. 9. Temporal simulation of the implementation of the video interfaces Fig. 10. Temporal simulation of the VPM that implements a one-dimensional convolution algorithm Fig. 11. Symbol for the VPM that implements a one-dimensional convolution algorithm verification of generic video algorithms beyond the limits of software simulation. In this sense, the proposed board offers an adequate infrastructure for scientific research, for technological development and for the innovation in this knowledge field, since it allows the experimental verification of real time video processing under real hardware constraints. Despite the fact that the board has been implemented into the Altera EP2C70F896C6 FPGA in Terasic s DE2-70 board, the proposed architecture is both technology and manufacturer independent. The only requirement is that the input signal of the architecture has to be a video bitstream in the BT.601 standard. Table I shows that in the presented implementation the device occupation with the video input and output interfaces is minimum, which allows the implementation of highly complex and highly computational demanding video operators in the VPM. VI. REFERENCES [1] C.A.Carneiro, Desenvolvimento de uma arquitetura paralela parametrizável de codificação JPEG-LS aplicada a imagens em altas taxas de aquisição. Dissertação de mestrado, Programa de Pós-graduação em Engenharia Elétrica, PUC Minas, 2007 (in portuguese). [2] A.R.M. Diniz, Arquitetura de hardware reconfigurável paralela dedicada para a implementação da SA-DCT. Dissertação de mestrado, Programa de Pós-graduação em Engenharia Elétrica, PUC Minas, 2007 (in portuguese). [3] P.J.C. Cunha, Arquitetura de Hardware Reconfigurável de um Circuito Dedicado para a Realização de Interpolação de Imagens. Dissertação de mestrado, Programa de Pósgraduação em Engenharia Elétrica, PUC Minas, 2009 (in portuguese). [4] ITU Recommendation BT.601-6 (01/07), http://www.itu.int, 2007. [5] K. Jack, Video Demystified - A Handbook for the Digital Engineer, LLH Technology Publishing, Eagle Rock, VA 24085, United States of America, ISBN 1-878707-56-6, 2007. [6] Terasic Technologies DE2-70 User Manual, http://www.altera.com/, 2007. [7] Analog Devices 10-bit, 4x Oversampling SDTV Video Decoder ADV7180, http://www.analog.com, 2007. [8] Analog Devices CMOS, 240 MHz Triple 10-Bit High Speed Video DAC ADV7123, http://www.analog.com, 1998. [9] V. A. Pedroni, Circuit Design with VHDL, MIT Press, Cambridge, Massachusetts, London, England, ISBN 0-262- 16224-5, 2008.