An FPGA Platform for Demonstrating Embedded Vision Systems. Ariana Eisenstein

Size: px
Start display at page:

Download "An FPGA Platform for Demonstrating Embedded Vision Systems. Ariana Eisenstein"

Transcription

1 An FPGA Platform for Demonstrating Embedded Vision Systems by Ariana Eisenstein B.S., Massachusetts Institute of Technology (2015) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Electrical Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 c Massachusetts Institute of Technology All rights reserved. Author Department of Electrical Engineering and Computer Science May 20, 2016 Certified by Vivienne Sze Emanuel E. Landsman (1958) Career Development Professor Thesis Supervisor Accepted by Dr. Christopher J. Terman Chairman, Department Committee on Graduate Theses

2 2

3 An FPGA Platform for Demonstrating Embedded Vision Systems by Ariana Eisenstein Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2016, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Electrical Engineering Abstract This thesis presents an FPGA platform that can be used to enable real-time embedded vision systems, specifically object detection. Interfaces are built between the FPGA and a high definition (1980 x 1080) HDMI camera, an off-chip DRAM, an FMC connector, an SD Card, and an HDMI display. The interface processing includes debayering for the camera input, arbitration for DRAM, and object annotation for the display. The platform must also handle the different clock frequencies of various interfaces. Real-time object detection at 30 frames per second is demonstrated by either connecting the platform to an object detection ASIC via the FMC connector, or directly implementing the object detection RTL on the FPGA. Using this platform, ASICs developed in the Energy-Efficient Multimedia Systems lab can be verified and benchmarked on both live video via the HDMI camera as well as pre-recorded media via an SD Card. Finally, a post-processing filter has been implemented on the FPGA to reduce false positives and interpolate missed object detections by leveraging temporal correlations. Thesis Supervisor: Vivienne Sze Title: Emanuel E. Landsman (1958) Career Development Professor 3

4 4

5 Acknowledgments I would like to thank my supervisor Professor Vivienne Sze for allowing me the opportunity to work with the Energy Efficient Multimedia Systems Group and develop this thesis. Thank you for the support and knowledge, which greatly enhanced my problem solving and design understanding. Thank you to Amr Suleiman for an being a continuous source of advice and guidance on this project. Without you this thesis would not be possible; good luck in your future projects. Thank you to my friends at MIT for being there for me through the ups and the downs. I would not have made it without you. Finally, thanks to my family for their continued support in all my goals. 5

6 6

7 Contents 1 Introduction Motivations for Hardware Accelerated Computer Vision Algorithms Discussions of Computer Vision Algorithms Implemented in System Overall System Design Previous Work - SuperUROP Camera Preprocessing Memory Interface and Management Detector Display DRAM Architecture and Arbitration Arbitration Architecture Memory Interface Module States Control Signals Response Labeling Write Phase and Input Display Read Phase and Output Detector Read Phase and Output FIFO Read and Memory Controller Communication Addressing Time Multiplexed Data Temporal Filtering Algorithm Exploration and Selection Previous Work

8 3.1.2 Base Algorithm and Exploration Performance Measurement Position Correlation Detection Score Thresholding Multiple Frame Correlation Forward Backward Correlation Results RTL Implementation Box Loading and Selection Correlation Calculation and Threshold Comparison Serialization and Modified NMS Parametrization Overall Performance Pre-Recorded Data Load On-FPGA ROM Static Image Storage SD Card SD Card Control Single Frame Load Video Load Conclusion Contribution Future Work Tracking PC Connection Summary A Conversions and Control Signals 69 A.1 User Interface Controller Signals A.2 Detector Output to Pixel Conversions

9 List of Figures 1-1 System Block Diagram, showing the movement of data from the camera input to the display output. An additional external input of SD Card data is shown. Boundaries are drawn to show the separate clocking domains within the FPGA system Camera Pipeline Architecture, showing the data flow from the camera input to the memory interface, as well as the location of the BRAM memory a: Bayer Filter Pattern, b: Possible Three-by-Three Matrices; each of the matrices can be mapped to a three-by-three location on the Bayer pattern a: Down-sampled Grayscale Image, b: Grayscale image after up-sampling; after interpolating the camera data, the effects visible in (a) are no longer present Organization of data in frame from Vita 2000 image sensor [21] Human Understandable Detector Results Detector Display module; Note the serialization of multiple detector results a: All Visible Boxes, b: Consolidated Boxes; after using the Non- Maximal Suppression modules the overlapping boxes are consolidated into single box around each person Ping Pong Architecture, demonstrating the architecture in which data is continuously flowing into one box calculation to be used, while the other calculation is fed to the display

10 1-10 Final pixel drawing in Detector Display module; Note the clock domains of each data path. represents bitwise or; the boolean will be False if the box pixel has not been assigned a value Memory Interface module Block Diagram showing the connection between the three modules corresponding to each external interface and the off-fpga DRAM Memory Interface Module Block Diagram showing three external interfaces as well as connection to DRAM. The camera clock 63 MHz, the display clock is 150 MHz, the memory clock is 200 MHz, and the DDR3 clock is 800 MHz. The detector clock depends on the detector being tested State machine showing the movement between phases of the arbitration UI memory controller and interface with external DRAM [32] Diagram showing the forward backward correlation scheme; the red lines indicate correlation between the current frame and previous two frames, the blue lines indicate correlations between the future frame and previous two frames Comparison of results, highlighting false positive removal. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth Comparison of results highlighting interpolated frame. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth Diagram showing the Average Precision curve of both the detected and filtered results for a single KITTI road video. For higher accuracy (recall) values, the filter demonstrates higher precision values. Note the line at recall = High level Block diagram showing the major filtering sub-modules. Pipeline occurs between each block

11 3-6 Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example Block diagram showing the pipelined correlation calculation and threshold comparison SD Card controller state machine for video reading SD Card controller state machine with PAUSE state included Arbitration state machine showing movement between phases of DRAM arbitration

12 12

13 List of Tables 1.1 System Clocks Module Utilization Module to Memory Location Filtering Test Results

14 14

15 Chapter 1 Introduction Many systems such as surveillance, advanced driver assistance systems (ADAS), portable electronics, and robotics need to process visual data collected from the world around them quickly and correctly. As these systems become smaller and more mobile, the processing will need to consume less power in order to achieve a reasonable battery life. Accelerating computer vision algorithms in dedicated hardware (i.e. ASICs [Application Specific Integrated Circuits]) allows faster processing of video data with higher resolutions and frame rates, while using less power and memory resources than traditional software based implementations. The Energy Efficient Multimedia Systems (EEMS) Group at MIT has developed several ASICs, which implement low power object detection algorithms. As described in [26, 27], these detectors operate on high definition 1920 by 1080 video data at a rate of fps [frames per second]. However, in order for the detectors to meet these metrics, they must receive HD video input at fps. Similarly, these detectors can produce results at clock frequency between 62.5 and 125 MHz. In order to demonstrate the chip s correctness, these results must be collected and converted into a human understandable form. Thus, the challenge is creating a suitable system, which can both provide data input and collect output from the chip, consistently, neither dropping data nor adding latency to the system. This thesis presents a platform developed on the VC707 Development Board with a Virtex-7 FPGA for testing and demonstrating complete systems using object de- 15

16 tection ASICs. This platform has two input configurations (Camera and SD Card). The camera provides live video input to demonstrate the detector s robustness on real-world content. The SD card transfers pre-recorded media, both videos and fixed frames to the detector. These can be used to test the ASICs on test images as well as benchmark the implemented algorithms on standard datasets. The video frame input to the system is stored off-fpga in the on board DDR3 DRAM memory. By storing the inputs in DRAM, multiple external interfaces can access the data independent to the input method. Using the DRAM memory decreases the utilization of FPGA resources in the overall system, allowing faster designs and quicker compile times. The development board platform communicates with the detector ASIC through an FMC connection. The detector operates on a frame by frame basis. The detector results are converted to boxes, which are drawn over the video data read from the DRAM. The final display is shown over the platform s one output configuration (HDMI Display). The final detector results are filtered to improve detector accuracy. This filter uses a correlation algorithm that compares the detector results over multiple frames. As the number of consecutive frames a detection appears in increases, so does the confidence of that detection. Chapter 1 outlines current approaches to the object detection, detailing the algorithms tested on this platform. This section further motivates the need for hardware accelerated computer vision algorithms. Additionally, a description of the overall system as completed as part of the SuperUROP project is given. This description includes two major modules: the camera interface, including the preprocessing of the camera data and the final results display, including both the collection of results and interface to the HDMI displayed. Additionally, the interface to BRAM memory and storing of fixed frames is discussed. Chapter 2 describes the integration of the off-fpga DRAM memory with the system. This chapter covers the complexities of designing the DRAM memory interface and methods, including the choice of controller and the methods of interface of 16

17 the modules that communicate with this memory. Additionally, the chapter outlines the challenges of time multiplexing the DRAM single port between three modules operating at different frequencies. The guards to ensure no data is dropped and the lowest latency is achieved are also discussed. Chapter 3 describes the development of a filtering algorithm that correlates the detection results over multiple frames. The filtering is used to both remove false positives and interpolate dropped detections, that occur for one to two frames. This process removes errors caused by noisy data sent to the detector, making the detection more robust. Additionally, this chapter includes details the RTL implementation of this algorithm and a discussion of utilization verses performance gain. Chapter 4 describes the SD Card interface, which is used to transfer prerecorded data into the system. This interface allows more diverse testing of the ASICs without needing to recompile the system by providing functionality to load different fixed images and full videos into this system post-compile, rather than encoding fixed data into a ROM pre-compile. 1.1 Motivations for Hardware Accelerated Computer Vision Algorithms Many previous software implementations of computer vision algorithms run primarily off of CPU/GPU platforms [4, 10]. The advantages of these systems are standardized camera and display management that are simple to implement and use. The disadvantage is the CPU/GPU s large power consumption and area utilization. For example, the Nvidia 8800 GTX2 uses 185W and takes up an area of 480 mm 2 at 90nm CMOS [19]. In addition, due to the processing times of these systems, they cannot operate at high resolutions. The CPU and GPU implementations proposed by R. Benenson et al. only work at a resolution of 640 x 480 despite running at high frame rates [2]. Recently, several vision systems have been developed for the FPGA to achieve 17

18 higher throughput. These systems tend to use a high on-chip memory, leading to higher hardware costs. The lower end of these implementations is in [28], which uses 1.22 Mbit on-chip memory. Additionally, many systems with lower memory costs only do single scaled detection. Single scale means that detection can only occur at a single, fixed distance from the camera and people at distances different from the specified would not be detected. To operate on real world data, these systems require customized input and output connections in order to achieve the stated performance. 1.2 Discussions of Computer Vision Algorithms Implemented in System The implementations discussed above use a variety of algorithms. Object detection algorithms have ranged from using cascades of simple features [29] to using edges and chamfer distances [15] to determine the presence of an object in the frame. The Histogram of Oriented Gradients (HOG) method was proposed by [8] and relies on invariant object features, mapped to a set of nine directed gradients, for detection. A detection score is generated using the dot product of the features and classifier weights. The HOG algorithm can be modified for higher performance by using a Deformable Parts Model (DPM), as demonstrated by Felzenszwalb et. al. in [12]. DPM creates collections of parts which correspond to pieces of the object. These parts are connected by spring-like connections allowing for deformation. In [13], DPM is shown to increase detection accuracy over single component models such as HOG. Hardware accelerated implementations of the HOG and DPM algorithms, including ASIC and FPGA, have been tested with this platform. Both implementations process high definition 1920 by 1080 video data. The HOG object detection implementation, developed by Suleiman and Sze in [26], supports multi-scale objects, meaning objects at a variety of distances can be detected. It achieves an average precision (AP) of 36.9% on the INRIA person data set [16], which is comparable to the original HOG AP of 40% [8]. For a 45nm SOI 18

19 process, the HOG implementation reaches 60 fps based on post-layout results. RTL for this HOG design was implemented directly on the FPGA. The DPM object detection implementation, developed by Suleiman, Zhang, and Sze in [27], supports multi-scale objects that are tolerant to deformation. It achieves an AP of over 80% of the INRIA person data set, and an AP of 26% on PASCAL VOC 2007 object data set [11]. The DPM implementation reaches 30 fps in a 65nm process and was tested as an ASIC connected to the platform. 1.3 Overall System Design Figure 1-1: System Block Diagram, showing the movement of data from the camera input to the display output. An additional external input of SD Card data is shown. Boundaries are drawn to show the separate clocking domains within the FPGA system. 19

20 The goals for the SuperUROP project were to create a platform to connect the ASIC based detector to the physical world. As discussed above this platform was built on the VC707 development board with a Virtex-7 FPGA. The external interfaces to the board, Camera 1, HDMI Display 2, SD Card 3, and Detector, transfer different data types at different frequencies. The challenges of creating this platform are converting data from each interface to an intermediate type that could be stored off-fpga and used by the other modules of the system. The high level block diagram of the overall system is shown in Figure 1-1. The architecture is divided into four major modules: Camera Preprocessing, Memory Interface, Detector Display, SD Card. These will be referred to as the camera module, memory module, display module, and SD module respectively. The Detector will be referred to as the detector module. When discussing the processes completed in each module and associated sub-module, the frame signifies a single frame from the video feed; rows signify horizontal rows of pixels; columns signify vertical columns of pixels; and pixels signify the pixels in the individual frame. The clocks for each module will be defined within the section discussing that module, the frequencies are shown in Table 1.1. Within the main modules, each individual sub-module is pipelined to processes data at the highest throughput possible with the least latency, given its input and output constraints. Primarily, all sub-modules of a given module operate in the same clock domain. Asynchronous FIFOs are used to communicate between modules as well as add robustness and modularity to the design. As shown in Figure 1-1, the display module performs processing in both clock domains. A double registering process is used to cross between clock domains in this module. 1 Initial Implementation by José E. Cruz Serrallés. 2 Initial Implementation by José E. Cruz Serrallés. 3 Initial Implementation by Luis A. Fernández Lara 20

21 Table 1.1: System Clocks Clock Frequency (MHz) System 200 Camera Detector Varies based on the detector Display 150 SD 100 Clocks used within the system. The Detector clock depends on the implementation being used. The HOG implementation uses a clock frequency of MHz and the DPM ASIC uses a clock frequency of MHz. Table 1.2: Module Utilization Component LUT Reg BRAM DSP VC707 Total System Total (18.42%) (6.81%) (24.51%) 280 (10.00%) Camera Preprocessing 2172 (0.71%) 2162 (0.36%) 2 (0.19%) 0 (0%) Detector Display (6.44%) 6564 (1.08%) 0 (0%) 124 (4.43%) Memory Interface 706 (0.23%) 2023 (0.33%) 186 (18.06%) 0 (0%) Memory Controller* (3.54%) 7787 (1.28%) 2 (0.19%) 0 (0%) Filtering (4.70 %) (2.25%) 3 (0.23%) 73 (2.61%) SD Card 463 (0.15%) 1011 (0.16%) 12 (1.17%) 0 (0%) Overall report of utilization of the components of the system. Percentage of total FPGA resources is shown in parenthesis. Filtering is not included in detector display. *IP core provided by Xilinx. The total utilization of the current system as well as details of significant modules is shown in Table 1.2 with a reference to total available utilization of the Virtex-7 21

22 FPGA. When designing this system, minimization of on-fpga BRAM was prioritized. BRAM space on the FPGA is limited and high on-fpga BRAM utilization leads to longer routing between components, which increases latency and compile time. The Virtex-7 is a large FPGA so minimization utilization of LUTs and Registers was not prioritized. The system input is data from a Vita by 1080 pixel camera operating at 92 frames per second. The camera data is sent to the FPGA platform over an FMC connection. The camera preprocessing sub-modules transform the data into usable data for the detector and display modules. An additional input mode transfers data stored on an SD card and serializes the data into usable data for the detector and display modules. As shown in Figure 1-1, the processed data is sent to the memory module to be stored in off-chip DRAM. Both the detector module, described in Section 1.2, and display module read pixels from the DRAM via the memory module. All interfaces to the DRAM store pixel data as 16 24bit pixel values per 512 bit word. Thus, DRAM usage allows for a modular and pipelined design. The different modules that read and write from the DRAM can be changed as long as the correct pixel representation is maintained on the input and output of the memory module. Using the design, switching between camera and SD card input or different detector chips requires changing the inputs to the DRAM module pre-compile. The detector module outputs coordinates and sizes, corresponding to regions of the frame where an object is detected. These results are transferred to the display module. The display sub-modules consolidates the results and transforms them into pixel coordinates. These pixels are drawn over the real-time video feed and displayed on an HDMI high definition display. 1.4 Previous Work - SuperUROP 2015 The following subsections will discuss the design of the Camera Preprocessing module, Memory Interface module, and Detector Display module. These modules were 22

23 completed as part of the SuperUROP project Camera Preprocessing The Camera Preprocessing module transforms the camera data into the 16 (24bit per pixel) full color representation stored in the DRAM. Each pixel stores 8bits of red, green, and blue color data. The camera has a maximum frame rate of 92 frames per second and outputs data in a Bayer Filter pattern (Figure 1-3.a). In order to maintain a high throughput, the preprocessing could not introduce significant latency into the system. Thus each module is pipelined for maximum efficiency. The block diagram for the sub-modules in the camera module is shown in Figure 1-2. Each module discussed in this section operates at the camera clock frequency (Table 1.1). The total utilization of the camera module is given in Table 1.2. Figure 1-2: Camera Pipeline Architecture, showing the data flow from the camera input to the memory interface, as well as the location of the BRAM memory. While the Bayer pattern allows the camera data to be down-sampled and represented with less bytes, the gradient image calculation used by the detector cannot be computed correctly with the down-sampled image. Without any preprocessing, the downsampled image when converted to grayscale as shown in Figure 1-4.a. This image breaks the person outline, and the object detector does not work properly. The image is upsampled using color interpolation, where full 24bit red, green, and blue (RGB) values are generated for each pixel. This interpolation processes 16 pixels in parallel at a rate of 3 cycles per interpolation. The module which completes these operations is labeled Debayer in Figure

24 RGB Interpolation from Bayer Filter Pattern The interpolation of the 24bit RGB pixel value is performed by averaging the red, green, and blue values in the three-by-three matrices surrounding each desired pixel, using the method and algebra described in [23]. Four different three-by-three configurations are possible; each one is shown in Figure 1-3.b. The selection of these matrices depends on both the column and row location of the pixel within the frame. These locations are tracked by incrementing counters for each pixel entering the camera preprocessing module. The averaging calculation is completed in a single clock cycle. Figure 1-3: a: Bayer Filter Pattern, b: Possible Three-by-Three Matrices; each of the matrices can be mapped to a three-by-three location on the Bayer pattern. Data must be used sequentially or will incur a memory cost. One of the challenges of this interpolation is generating the surrounding matrix for each pixel. The averaging at each pixel requires pixels both sequentially before and sequentially after the center pixel to complete the matrix. A Block RAM (BRAM) stores three rows of the frame and is pipelined to continuously update as pixels are moved into the camera preprocessing module. Each BRAM location stores sixteen pixels as represented by the Bayer filter pattern. Therefore, this BRAM has width 128 (16 pixels, 8bits per pixel) and depth 360 (number of groups of 16 pixels in three frame rows). To calculate the interpolation for the 16 pixels at a given address a, first, the pixels at location a in BRAM are read and stored in registers. These represent the center row of the 24

25 interpolation matrix. Locations a and a are read in the next two cycles representing the the bottom and top rows of the interpolation matrix respectively. The address wraps if it exceeds 360, thus the row being read will wrap. The data output is delayed by a single clock cycle to allow processing of the right most pixel in the word of 16 pixels. Each pixel word is processed in four cycles, three clock cycles to read each row and one cycle of delay. Thus, color interpolation has a throughput of sixteen pixels every four clock cycles or 12 bytes per cycle. Figure 1-4: a: Down-sampled Grayscale Image, b: Grayscale image after up-sampling; after interpolating the camera data, the effects visible in (a) are no longer present When performing the interpolation, the corner cases on the top and left side of the frame require special treatment. On the left side of the frame, at the beginning of each row, values could not be interpolated until the final BRAM memory access. This delay caused missing and incorrect values for the first two columns in each row. In order to fill in the missing leftmost color values, a special interpolator, capable of operating on a smaller 2x3 matrix, was created and used to generate the missing pixels, as well as logic to send the necessary data to these interpolators. To ensure the camera data is stored in the correct location in Memory, the reset signals of both the Debayer BRAM and Memory are synced to the camera frame start signal. The final result is the up-sampled as shown in Figure1-4.b, without noise effects. The Debayer module is pipelined such that it reads data from the BRAM, while 25

26 sending data to be written in the Memory, generating the interface addresses for both simultaneously as shown in Figure 1-2. Thus, the total latency on the data path from the camera to the memory module is 7 cycles. This delay is absorbed by the memory module, and the full frame appears correctly. Additionally, if further data compression is necessary, we convert the RGB data to the YUV colorspace. The detector uses grayscale images, and luminance (Y) values give more accurate grayscale values than simply averaging the red, green, and blue values. By discarding the color components (U,V), the amount of data moving through the system is reduced by 1. 3 RGB to YUV Conversion The RGB to YUV conversion is computed using the full conversion matrix and the following calculations: Y R U = G (1.1) V B Y t = (Y + 128) >> 8 Ut = (U + 128) >> 8 (1.2) V t = (V + 128) >> 8 Y u = Y t Uu = Ut (1.3) V u = V t The matrix in Equation 1.1 uses whole number calculations to simplify computation in hardware. Equation 1.2 scales the values to 8bit integers with rounding. Additional offsets are added in Equation1.3 to eliminate negative values, further de- 26

27 Figure 1-5: Organization of data in frame from Vita 2000 image sensor [21] creasing the hardware complexity of the system. This sub-module has a throughput of 16 pixels per cycle. Thus, the total latency of the camera pipeline is increased to 8 cycles. The throughput decreases from 12 bytes per cycle to 8 bytes per cycle. 3 The camera outputs dummy data at the end of each row and end of each frame, while the camera is syncing [21]. Figure 1-5 details the pattern of this dummy data. This dummy data is not marked as valid and therefore not sent to the memory module Memory Interface and Management The superurop project implemented the system frame storage with the on-fpga Block RAM (BRAM). The Virtex-7 has about 9MB of BRAM on-chip [33]. In practice, less than 5MB of BRAM could be used in this system due to routing delays cause by high FPGA utilization. The system requires 4MB of BRAM to store two grayscale frames of size 2MB. Two frames must be stored because the BRAM IP cores provided by Xilinx only support up to two ports. One frame buffer is read by the detector at the detector clock frequency; one frame buffer is read by the display at 27

28 Figure 1-6: Human Understandable Detector Results the display clock frequency (Table 1.1). As mentioned above, a 2MB frame can only store grayscale frames. Each BRAM location stores 16 Y values, 8bit each. Full color frames require 6MB on-fpga BRAM, which is beyond the FPGA availability of this system. Thus, in this configuration only grayscale frames can be shown on the final display Detector Display The detector display module transforms the detection results into a human understandable format as shown in Figure 1-6. The block diagram of the sub-modules contained in the display module is shown in Figure 1-7. The results are converted from a box representation to pixels coordinates. These pixels move from the detector clock domain to the display clock domain to be drawn on the frame output. The final results are displayed in real time, providing a more convincing result than simulations 28

29 Figure 1-7: results Detector Display module; Note the serialization of multiple detector alone. The display module input is the object box representation generated by the detector and the frame stored in the off-fpga DRAM. Multiple detection results can be produced in the same cycle to increase the throughput of multi-scale detection. The outputs of these detectors are serialized so each set of results can be processed separately and the display is robust to different detectors. The detector data is converted into pixel data that reflects the size and location of detected boxes in the frame in pixel coordinates. This process is described in detail in Appendix A.2 and takes a single clock cycle operating at the detector clock frequency (Table 1.1). The boxes are consolidated using a non-maximal suppression (NMS) algorithm. Figure 1-8: a: All Visible Boxes, b: Consolidated Boxes; after using the Non-Maximal Suppression modules the overlapping boxes are consolidated into single box around each person The NMS algorithm used in the architecture compares two boxes at a time to 29

30 determine overlap. An algorithm that performs a local maximum search to isolate the areas of interest, NMS has multiple applications in computer vision as discussed by [18]. This system applies NMS in the same manner as [8] to isolate object instances. Without this algorithm, many overlapping boxes would be drawn over the video feed as shown in Figure 1-8.a. Non-Maximal Suppression The Non-Maximal Suppression (NMS) sub-module collects results associated with the same object for an enhanced final display. A naive approach consolidates any boxes with overlap. This method does not work in this implementation, as objects in close proximity may cause overlapping boxes. An example of this behavior is visible in Figure 1-8.b. To account for this effect, the areas of intersection of two boxes and the area of the union of two boxes is calculated. These values form a ratio, intersection over union, which is compared to a threshold to determine where the boxes are associated with the same object. In this system, the threshold is 1. The calculation for two boxes, b1 and b2, is 2 shown below: iw = max(b1x0, b2x0) min(b1x1, b2x1) ih = max(b1y0, b2y0) min(b1y1, b2y1) uw = min(b1x0, b2x0) max(b1x1, b2x1) uh = min(b1y0, b2y0) max(b1y1, b2y1) ib = iw * ih ua = uw * uh ua < 1 * ib (1.4) threshold where ua is the union box area and ib is the intersection box area. The calculation in Equation 1.4 is computed using multiplication and comparison, avoiding costly division. Additionally, any box completely contained by the another box is immedi- 30

31 ately characterized by the larger box. These calculations take a single clock cycle to complete. In the current implementation, thirty individual person boxes can be displayed in a single frame. This value is parametrizable and can be adjusted pre-compilation to suit the needs of the application. For the remainder of this section, thirty will be used as the number of boxes. The comparison of the new box with the stored boxes has two possible results. First, the box overlaps with the stored boxes. In this case, the incoming box is averaged with the stored box. This average is stored. Second, the box does not overlap with the stored boxes. If all thirty displayed boxes are filled with an object box, the incoming box is unused. Otherwise, it is added as one of the collection of thirty displayed boxes. The set of 30 boxes is continuously updated during each frame. A ping-pong architecture is used to parallelize box calculation and box display. This architecture continuously calculates a set of stored boxes, while a different set of stored boxes is displayed. Using this method causes a single frame delay between the results from the detector and the results being displayed. The final results are overlaid on each corresponding frame as shown in Figure 1-8.b. Ping-Pong Architecture In order to maintain a correct and consistent display, the boxes drawn to the frame are fully updated before the frame is displayed, using a ping-pong architecture. A block diagram of this architecture is shown in Figure 1-9. After the detector finishes processing a frame, the boxes being calculated are displayed, and a new set of boxes is calculated. Thus, the calculated boxes are not shown on screen until a full frame is processed. Since the updating process and the display process happen simultaneously, there is only a single frame delay between the information from the detector and the information being displayed. 31

32 Figure 1-9: Ping Pong Architecture, demonstrating the architecture in which data is continuously flowing into one box calculation to be used, while the other calculation is fed to the display. The final box results are transformed into a set of pixel locations. A single color is drawn at each of the locations in the frame to display the boxes. This tasks requires matching the pixel locations to the correct locations in each frame. The data flow is shown in Figure If the location contains a colored pixel, the colored pixel is displayed, otherwise display the frame pixel is displayed. The comparison is challenging because colored pixel calculation is performed at the detector clock frequency and the frame pixels are checked at the display clock frequency. The modules described operate on the detector clock and the final box pixel values are converted to the display clock domain through double register synchronization. Additionally false timing paths are declared between the two clocks to satisfy the system timing constraints. Figure 1-10: Final pixel drawing in Detector Display module; Note the clock domains of each data path. represents bitwise or; the boolean will be False if the box pixel has not been assigned a value. 32

33 Chapter 2 DRAM Architecture and Arbitration Off-chip DDR3 DRAM is used to decrease the utilization of the on-fpga memory. When using the DRAM, the system can store and read multiple frames of data as the DRAM has 1 GB of memory available, in comparison to the 9MB of on-fpga BRAM. Although DRAM is bigger and can store more data than BRAM, DRAM read requests have more latency and streaming data is more difficult. Removing the BRAM utilization decreases the overall FPGA utilization significantly. Lower FPGA utilization allows shorter routes between components and shorter compile times. Shorter routes decreases the routing delays between components, allowing the system to be clocked faster. The full utilization of the DRAM Memory Interface is shown in Table 1.2. The memory module s utilization is dominated by the IP core provided by Xilinx. The Memory Interface module directs data from the other modules in the system to the underlying Xilinx Memory Controller. A block diagram of the placement of the memory interface module in the system is shown in Figure 2-1. This module is designed with several layers of abstraction, to be robust to changes to the interfaces that send it requests. The DRAM is a single port device. As such, only one block can send a request to the DRAM each cycle. Thus, the memory module has an arbitration with three phases of operation: write phase, detector read phase, and display read phase. 33

34 Figure 2-1: Memory Interface module Block Diagram showing the connection between the three modules corresponding to each external interface and the off-fpga DRAM. 2.1 Arbitration The camera, detector, and display blocks must each access the frame data stored in the system. Each of these blocks has a different bandwidth. As such the DRAM acts a buffer between these blocks. The DRAM is a single port device. As such, only one block can send a request to the DRAM each cycle. As the system contains three different modules that must communicate with the DRAM, a form of arbitration is necessary to determine which block s requests are sent to the DRAM during each clock cycle. The memory interface module has three phases of operation: write phase, detector read phase, and display read phase. The phase with the highest bandwidth is given the highest priority; the next highest bandwidth is given the next priority; and the lowest bandwidth is given the lowest priority. Bandwidth is defined in MBPS [megabytes per second] or GBPS [gigabytes per second]. The camera has a bandwidth of 546 MBPS and the display has a bandwidth of

35 MBPS. The detector is assumed to have a bandwidth equal to or less than the display bandwidth, but depends on the detector ASIC being tested. Thus, the detector is assumed to have the lowest bandwidth in the system. As such the priorities of DRAM accesses of the system are camera, display, and detector. The required bandwidth for the system is the sum of these values. The maximum required bandwidth is about 1300 MBPS. The bandwidth of the DDR3 on the VC707 development board is 12.5 GBPS [34], and thus this scheme is feasible. From each module, a stream of requests is generated independently. The camera generates a stream of write requests. The display and detector expect a stream of read responses. The arbitration satisfies the condition that none of these streams are stopped. The camera block can always write data, and data is always available for the display and detector blocks to read. In order to maintain this condition, the camera s write requests and the display s and detector s read responses are stored. The memory interface module sends the camera write requests to DRAM after the number of write requests stored increases above a certain level. This behavior ensure that the camera can continue to make write requests without any request being dropped. The memory interface module sends display or detector interface read requests to the DRAM after the number of read responses decreases below a certain threshold. The behavior ensures that both the display and detector consistently have data available to be read. 2.2 Architecture The previous section gave a high level overview of the arbitration required to time multiplex between three blocks that operate with different bandwidths. The data from each of these blocks is stored in an asynchronous FIFO of depth This FIFO both converts the data between different clock domains and makes the memory module robust to the different clock frequencies possible for each interface. The FIFOs are sized to be robust to the different module bandwidths without overflow or underflow. The memory interface module interfaces with the DRAM via a memory 35

36 controller provided by Xilinx. The system requires that a complete frame must be transferred to the detector from the memory controller at 356 MBPS in order to maintain 60fps performance in the detector and demonstrate real time results. A block diagram of the architecture of the Memory Interface module is shown in Figure 2-2. Figure 2-2: Memory Interface Module Block Diagram showing three external interfaces as well as connection to DRAM. The camera clock 63 MHz, the display clock is 150 MHz, the memory clock is 200 MHz, and the DDR3 clock is 800 MHz. The detector clock depends on the detector being tested Memory Interface Module States Control Signals The memory interface module can be in three different states: write phase, detector read phase, or display read phase, corresponding to the block whose requests are being sent to DRAM. Each of phases require different control signals to be sent to the FIFOs and DRAM controller to ensure that the content of each request is correct. The arbitration module assigns these control signals and operates at the Memory 36

37 Interface module bandwidth given in Section 2.1. A detailed discussion of the DRAM control signals is provided in Appendix A.1; this section will refer to read and write requests and read responses. A state machine showing the transitions between each phase is shown in Figure 2-3. Figure 2-3: State machine showing the movement between phases of the arbitration. In the write phase, the read enable signal for the camera FIFO goes high. The data from the FIFO is broken into address and write data and is sent to the DRAM in a write request. The FIFO uses a first-word fall through, meaning data is available on the same cycle the read enable goes high. In the detector read phase and display read phase, a read request is sent to the DRAM using the internally generated detector read address or display write address. If a valid read response is available, the detector FIFO or display FIFO s write enable signal goes high and the data is stored in the FIFO. 37

38 2.2.2 Response Labeling A challenge in the arbitration module is determining to which block the response from DRAM is sent to. The latency between a read request and a read response is variable due to different DRAM access patterns; switching pages, switching between read and write requests, and refreshes of the DRAM will negatively impact throughput [32]. As such, the data can not be directly linked to a specific request. In order to ensure correct functionality, a new protocol counts the number of requests made by each module. The arbitration labels all responses as one module until the number of responses is equal to the number of requests that module sent previously Write Phase and Input The camera data is addressed to a location in the frame in the camera module. Both this address and the associated data associated are sent to the camera block. The concatenation of address and data is stored in asynchronous FIFO. The memory interface module enters write phase when this FIFO has over 4000 entries. As the arbitration module has a much higher bandwidth than the camera block, the number of entries in the FIFO decreases. The memory interface module will not read the camera FIFO if it contains less than 600 entries, because otherwise the FIFO will underflow Display Read Phase and Output As discussed above, the camera data has highest priority in the arbitration. As such, the memory interface module will always switch to the camera read phase if the write FIFO empties below 600 entries. Otherwise, the memory interface module can enter display read phase if the display FIFO contains less than 4000 entries and remains in the display read phase. The behavior maintains the priority of the camera write requests over the display read requests. As read responses are generated by the DRAM and labeled as the display block s, they are stored in the display FIFO. This FIFO is read by the display block once every 16 display clock cycles. As the arbitration module 38

39 generates the display read requests, the display FIFO fills with at the bandwidth of the arbitration module and empties at the bandwidth of the display block. As the arbitration module has a much higher bandwidth than the display block, the display FIFO fills while the memory module is in display read phase and will not underflow. Display read requests are only made while the memory interface module is in display read phase. As such the display FIFO does not overflow Detector Read Phase and Output As with the Display Read Phase, the memory interface module will always switch to the camera read phase if the write FIFO empties below 600 entries. Otherwise, the memory module can enter the detector read phase if the detect FIFO has less than 4000 entries and the display FIFO has greater than 600 entries. The memory module remains in detector read phase unless the display FIFO empties below 600 entries which causes a shift to display read phase. This behavior maintains the priority of camera write requests and display read requests over detector read requests. As read responses are generated by the DRAM and labeled as the detector block s, they are stored in the detector FIFO. As the arbitration module generates the detector read requests, the detector FIFO fills with at the bandwidth of the arbitration module and empties at the bandwidth of the detector module. As the arbitration module has a much higher bandwidth than the detector module, the detector FIFO fills while the memory interface module is in display read phase and will not underflow. Detector read requests are only made while the memory module is in detector read phase. As such the detector FIFO does not overflow FIFO Read and Memory Controller Communication The DDR3 DRAM memory controller is generated using the Memory Interface Generator (MIG) IP provided by the Vivado development environment[32]. Specifically, the User-Interface (UI) controller [32] was selected due to its level of control of the DRAM signals. A diagram of this controller is shown in Figure 2-4. The UI controller s in- 39

40 puts are a request address, mode of operation, and enable signal. The controllers outputs are a data response and a valid bit. DRAM memory bursts and optimal address ordering are performed internally. Additionally, this controller converts data from the system clock domain (Table 1.1) to the DRAM frequency of 800MHz. This memory controller directly calls the DRAM and is assumed to function according to Xilinx specifications. Figure 2-4: UI memory controller and interface with external DRAM [32] Addressing Time Multiplexed Data To ensure that no data is overwritten before the desired module accesses the memory location, the frames of video data are stored in different locations in the DRAM memory. After the system begins operation, three separate frames of video data are stored in sequential frame-sized locations in the DRAM. These locations are designated A,B,C. Which module writes to each location is determined by an internal state machine. Initially, the camera module writes a frame of video data to location A. When the final detector address signals an arbitration reset, the state switches. Now, 40

41 the camera module writes a frame to memory location B, followed by the detector module reading a frame from location A. Again the state switches, and the camera module writes to location C, the detector module reads from location B, and the display module reads from location A. This pattern continues for each frame that is written to DRAM. Table 2.1 shows this access pattern. Using this pattern leads to a frame delay between the frame read by the detector and display. Additionally, the pattern prevents a highly pipelined design which would allow reading of data as soon as it was written. This design prioritizes correctness over speed as updating location only after all three modules have either read or written a complete frame ensures that each frame is complete and has not been modified while being read or written. Additionally as the memory module operates at the system frequency, the highest frequency possible within the FPGA, the highest possible throughput for this scheme is achieved. Table 2.1: Module to Memory Location Module State 1 State 2 State 3 State 1 Camera A B C A Detector C A B C Display B C A B A diagram showing the location each module accesses in memory during each state. The state changes when all the detector module finishes reading a complete frame. There are three states: 1,2,3 in the state machine. 41

42 42

43 Chapter 3 Temporal Filtering Computer vision and object detection algorithms that operate on a frame by frame basis, such as the HOG and DPM algorithms developed by [26, 27], are highly susceptible to noise within the frame. This noise can create flashing effects: either a false positive that appears for one to two frames, or a true detection that is lost for one to two frames. In order to eliminate these noise effects, a correlation filter is implemented on the boxes produced by the NMS algorithm (Section 1.4). This filter correlates the boxes detected in the previous and future frames to both filter out flashing detections and interpolate dropped detections. This chapter begins with a literature review of relevant correlation algorithms culminating in the selection of a specific implementation to explore. Experimentation with different parameters in software leads to the final selection of algorithm to implement in hardware. This hardware s implementation is described and analyzed. 3.1 Algorithm Exploration and Selection Multiple algorithmic methods exist for correlating data between sequential frames. This section first explores the algorithmic approaches to correlating object detection over multiple frames as presented in the literature. From these, a general class of algorithm is selected for exploration of performance and usability in the system. Increasingly complex iterations on this algorithm were implemented and tested in 43

44 Matlab. The algorithm with the best performance is selected for implementation in RTL Previous Work Multiple algorithmic approaches are used to track objects over multiple frames of video data. The review focuses on algorithms that perform tracking by detection, to maintain independence of the detector and filter. Two major classes of tracking by detection algorithms exist: correlation between frames as demonstrated by [1, 6, 24] and particle filters as demonstrated by [5, 7, 20]. Particle filters require knowledge of the world coordinates, and therefore are not used in this implementation. Within the correlation algorithms, several approaches for calculating correlation are explored. Wu et al. propose the following factors for correlation in [30]: ( ) ( ) (x 1 x 2 ) 2 (y 1 y 2 ) 2 A pos (p 1, p 2 ) = γ pos exp exp σ 2 x σ 2 y (3.1) ( ) (s 1 s 2 ) 2 A pos (s 1, s 2 ) = γ size exp σ 2 s (3.2) p 1, s 1 and p 2, s 2 represent the position and size of the first and second feature being compared respectively. γ pos, σ 2 x, σ 2 y, γ size, σ 2 s are normalizing coefficients. These factors are used by Shu et al. in the following calculation of affinity matrix M [25]: M(i, j) = C(i, j)a pos (i, j)a size (i, j) (3.3) C(i, j) is the classifier comparison, which is a comparison of feature points. While these algorithms have good performance in software, achieving 71.4% detection precision and 73.5% detection accuracy on the Oxford Town Center Dataset [3], the hardware implementation of an exponential function would have less precision than a floating point software implementation and would lead to high resource usage preventing high levels of parallelism. A less computationally expensive calculation is 44

45 proposed by Segen and Pingali in [24]: m = dx2 + dy 2 σ 2 x + do2 σ 2 o + dc2 σ 2 c (3.4) where dx, dy are the difference in x and y position, do is the difference in orientation, and dc is the difference in curvature. σx, 2 σo, 2 and σc 2 are scaling factors. The algorithm achieves a tracking accuracy of 80% on a test sequence gathered by the authors [22]. The correlation calculation uses addition, multiplication, and division. Multiplication and addition are not costly in hardware. As the division is by a constant, the algorithm can be easily converted to use a multiplicative normalizing factor. Thus, this algorithm is selected for performance analysis within the system. An additional consideration when using a correlation algorithm is to choose how correlation increases detection confidence. Shu et al. proposes the notion of that detections must be present over a certain number of frames before they can be considered correct[25]. These detections are considered in correlations prior to this number of frames but are not drawn until they achieve a high enough certainty. Using this method, new objects can be initialized after being present for a certain number of frames. The number of frames is explored in the software exploration Base Algorithm and Exploration To use this algorithm proposed by [24] in the developed system and with detector outputs, several modifications must be made. First, the detector outputs position and scale of each object detection. Thus, the orientation and curvature terms are replaced by a scale term, representing size of the box, for the following equation: m = dx2 + dy 2 σ x + dx2 scale + dy2 scale σ scale (3.5) where, with 1 being the first box of comparison and 2 being the second, 45

46 dx = x 1 x 2 dy = y 1 y 2 dx scale = xscale 1 xscale 2 dy scale = yscale 1 yscale 2 The scaling factor were selected such that each term equals 1 when the values have the maximum difference. σ x = and σ scale = (443 64) 2 + ( ) 2, encompassing the squaring terms of Equation 3.4. Finally, correlation Equation 3.5 is modified to convert the divisions into multiplications as shown in the final equation: σ x σ scale m = σ scale (dx 2 + dy 2 ) + σ x (dx 2 scale + dy 2 scale) (3.6) The threshold value for this comparison is 0.5*10 14 *σ x *σ scale and was determined empirically. As shown in [25], using additional information from the detection can increase the confidence of the correlation. The use of detection score, a measure of detection confidence defined in Section 1.2, is explored in the software simulations. Also presented in [25], is the notion of the detection being correct only after appearing in a certain number of frames. This number and weighting of frames is explored as well Performance Measurement The software-based filtering algorithm s output performance was benchmarked against the KITTI Vision Road dataset [17]. The performance of each iteration of the algorithm was compared to the performance of solely the detection using the same data set. The two metrics considered are precision and recall, which are calculated using 46

47 the following equations: number correct detections precision = total boxes drawn number correct detections recall = number boxes in benchmark A box is counted as correct if it overlaps with a benchmark box by at least 50%.The results of running the detector only on one video of the KITTI Dataset are shown in Table 3.1. The ideal performance of the filter increases the precision, while maintaining a nearly constant recall in comparison to the base detection algorithm Position Correlation The simplest form of correlation is to compute the correlation between boxes in the current and previous frame. Each box in the current frame is compared to all boxes in the previous frame using Equation 3.6. The lowest value for each box in the current frame is the correlation score. The minimum score is zero implying the same box occurred in both frames. If the correlation score associated with the box is below the threshold, the box is drawn. The results of running this algorithm are shown in Table 3.1. The precision does increase, but with a decrease in recall Detection Score Thresholding In addition to thresholding on the correlation score, the detection score of the current boxes and previous boxes can be used as an additional threshold. For the boxes with the lowest correlation score, the two detection scores are averaged. As with the correlation score, if the average detection score is below a score threshold, the box from the current frame is drawn. A correlation cannot be computed between the current and previous scores as the scores do not differ greatly between different objects. The results of running this algorithm are shown in Table 3.1. With detection score thresholding, there is a small precision increase and a greater decrease in recall. This effect confirms that detection score does not differentiate between objects with 47

48 high accuracy Multiple Frame Correlation The previous to algorithms only required the object to be present in a single previous frame for the object to be considered correct. As discussed in Section 3.1.1, the more frames correspond to higher confidence. Multiple Frame Correlation compares the current frame to the two previous frames. In this implementation, a box is assumed to be correct (not noise) if it persists for more than two frames. The correlations to the two previous frames are done by doing a single frame correlation between the current frame and previous frame and doing a second single frame correlation between the current frame and the previous previous frame. The results of these two correlations are combined by weighting the previous frame correlation by 0.75 and the previous previous frame correlation by This value is used in the threshold comparison to determine whether the box is drawn. The results of running this algorithm are shown in Table 3.1. Multiple frame correlation creates a more significant precision increase than using a single frame for correlation, but a slightly greater decrease in recall than single frame correlation Forward Backward Correlation The previous algorithms are designed to eliminate noisy boxes from the detection. In order to interpolate boxes that have been dropped by the detection, correlations using a future frame are used. Thus, there are four correlations performed in total as shown in Figure 3-1. First, all boxes in the current frame are checked, and the filtered boxes are stored. Then, the future frames boxes are checked; if a box detected has a correlation score above the score thresholds, and does not match any of the previously stored filtered boxes in the current frame, the box is added to the collection of filtered boxes. 48

49 Figure 3-1: Diagram showing the forward backward correlation scheme; the red lines indicate correlation between the current frame and previous two frames, the blue lines indicate correlations between the future frame and previous two frames. The boxes for each frame cannot be accessed until the frame has been processed by the detector. Each frame is processed sequentially. In order to access the future frame, the system delays the results by one frame. The results from the current frame are the future boxes, while the results from the previous frame are the current results.the results of running this algorithm are shown in Table 3.1. Forward Backward gives a higher recall, while increasing precision in comparison to detector only results Results From the results of the simulated algorithms, a Forward Backward Algorithm using equation 3.6 to generate a correlation score over multiple frames was chosen for hardware implementation. Correlating over multiple frames gave the greatest increases in precision of all the algorithms. The Forward Backward algorithm prevents a significant decrease in recall. 49

50 Table 3.1: Filtering Test Results Algorithm Detected Benchmark Missed Precision Recall Detector Only Single Frame, Size only Single Frame, with Score Single Frame, Size, FB Single Frame, Score, FB Two Frames, Size only Two Frames, with Score Two Frames, Size, FB Two Frames, Score, FB A table detailing the performance of each algorithm on the KITTI dataset. FB stands for forward backward. The first three columns correspond to totals of boxes detected, boxes in the benchmark, and boxes missed by the filter, over a video. The Precision and Recall columns are calculated by the methods described above. A visual comparison of the detection results and the the filtered detection results in a single frame is shown in Figure 3-2. In the second row, a false positive caused by noise has been removed while correct boxes remain. The interpolation of dropped boxes is illustrated in Figure 3-3. In the second row, a box dropped by the detection has been interpolated by the filter. 50

51 Figure 3-2: Comparison of results, highlighting false positive removal. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth. Figure 3-3: Comparison of results highlighting interpolated frame. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth. 51

52 Figure 3-4 shows the Average Precision (AP) curves of using the filtering algorithm verses using the detector only (Blue and Red curves respectively). These curves are generated by sweeping the detection threshold from a high to low level. The correlation threshold must be adjusted for each different KITTI video. As shown in the plot, for lower thresholds, the filter has a higher precision than the detector only, as false positives are being removed from the detection results. At high thresholds, the filtering sees worse performance as sparse detection results cannot be correlated, and the detector does not produce noisy results. The optimal threshold for high recall values occurs with lower thresholds. Thus, the filter is effective over the optimal threshold values. Above these thresholds, it is assumed filtering is unnecessary. Only considering the values with recall > 0.5, the filtering increases the average precision by 6.23% on a single video of the KITTI data set. Figure 3-4: Diagram showing the Average Precision curve of both the detected and filtered results for a single KITTI road video. For higher accuracy (recall) values, the filter demonstrates higher precision values. Note the line at recall =

53 3.2 RTL Implementation The filter architecture uses a pipelined approach where four boxes are processed in each cycle. The new boxes are loaded from final display boxes from the NMS submodule. The boxes are selected from each of the frames as shown in Figure 3-1. The filter is divided into three pipelined phases: box loading and selection, correlation calculation and threshold comparison, and serialization and modified NMS. Pipeline occurs both in between each of these blocks and in the blocks themselves to meet the timing requirements of the system. A block diagram of the overall filter design is shown in Figure 3-5. Figure 3-5: High level Block diagram showing the major filtering sub-modules. Pipeline occurs between each block Box Loading and Selection After the detector module fully processes a frame, the results of the previous frames NMS are available for the filtering algorithm. Using a shift register architecture, the new boxes are moved into the first location, and all other entries are pushed back by one, with the final entry being discarded. The shift register is used over a FIFO or BRAM structure, for the ability to access multiple elements in the same cycle.the values in this shift register are then stored in register array structures for more direct access to the individual boxes. The loading phase has a latency of two cycles, one for shift register update and one for array register update. 53

54 Figure 3-6: Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example To select the set of boxes from each frame, two internal indices are incremented, representing the box to select from each frame. index_1 is the box index for the current and future frame; index_2 is the box index for the two previous frames. Accessing the two previous frames with the same index each cycle and the current and future frames with the same index each cycle creates the highest level of parallization and is the minimum number of cycles to assess all combinations of boxes. There are number of boxes 2 combinations possible between two frames. Figure 3-6 demonstrates the complete process Correlation Calculation and Threshold Comparison The block diagram of the correlation calculation and thresholding pipeline is shown in Figure 3-7. For each combination of index_1 and index_2, four correlations are computed as shown in Figure 3-1. These correlations are computed in parallel, using equation 3.6. This computation is pushed to DSP blocks during compile. The 54

55 output of this correlation is a 45bit integer; the bit width is increased due to the multiplicative normalization factors. The correlations between the two frames are combined with a weight of 0.75 on the first previous frame and 0.25 on the second previous previous frame. These computations also require 45bit integers. The final result is compared to the threshold defined in Section The boolean result of the comparison is stored in a register. If the boolean is true, the current box is stored in a register, otherwise zero is stored for the box data. The latency for these blocks is three cycles as shown in block diagram in Figure 3-7. Figure 3-7: Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example Serialization and Modified NMS The four correlation results are serialized, stored in a FIFO of size 1024, and compared using a modified Non-Maximal Suppression module. This pipeline is shown in Figure 3-8. The FIFO is generate using an IP core provided by Xilinx, which has a minimum size of entries is oversized for the current filtering application, but utilizing the on-fpga internal FIFOs is a better utilization strategy. The modified Non- Maximal Suppression algorithm still takes the new box and the previously stored 55

56 final boxes as input. It is modified to only output an overlap threshold. If the boxes do not overlap, the new box is added to the final boxes. Figure 3-8: Block diagram showing the pipelined correlation calculation and threshold comparison. The maximum latency of this module is the maximum number of boxes stored in the FIFO, which is (number future frames + 1) * number of boxes 2, plus one cycle to store the results of NMS2. In practice, as the boxes are not written to the FIFO if neither the future nor the current box is correct, this latency can be much lower Parametrization While the parameters empirically determined for the filtering algorithm give the system adequate performance, future iterations of this sub-module can be further optimized. Considering future work, the filter module has been designed to allow a high level of parametrization pre-compile. Parameter controlled are: number of boxes per frame, number of future frames, number of previous frames, and correlation thresholds. In the current implementation, nb = 30, ff = 1, and bf = 2, where nb is the number of boxes per frame, ff is the number of future frames, and bf is the number of back frames considered Overall Performance The filtering begins when the NMS calculation completes as discussed in Section 1.4. As such, the filtered boxes are not available to be shown at the beginning of each frame. Summing the latency of each sub-modules gives the following equation: 56

57 latency = T (load boxes) + T (correlation, comparison) + T (serialization, NMS2) = nb 2 + (ff + 1) * nb = O(nB 2 * ff ) where nb is the number of boxes per frame and ff is the number of future frames considered. In the current implementation, nb = 30 and nf = 1, so latency = 1800 cycles. The filtering module uses the detector clock, so the latency is 18μs. As each frame is displayed for 33.3 milliseconds, this latency is acceptable. The latency increases with O(nB 2 ) when changing the number of boxes in each frame and O(nF ) when changing the number of future frames. Additionally, as mentioned above, each future frame causes the detected results to be displayed a frame later than the image. The total utilization of the filter sub-module is LUTs (4.7%), Registers(2.25%), 3 36k BRAM tiles (0.29%), and 73 DSPs (2.61%). The high Register and LUT usage results from the storage and access of box data. Additionally, the large bit widths used for correlation and comparison calculation also increase register usage. This sub-module prioritizes high throughput and increased detection performance over resource utilization. For the following increases in utilization: 34% increase in LUTs, 49% increase in Registers, 1.2% in BRAM, and 35% increase in DSP, the system has increases detection accuracy by 6.23%. 57

58 58

59 Chapter 4 Pre-Recorded Data Load In addition to the live video camera input, the testing system can either be compiled with ROMs storing static frames or transfer pre-recorded data into the system via the SD Card. These interfaces are necessary to better test and benchmark the different detection algorithms being evaluated. The ROMs replace the BRAM memory described in Section 1.4. The SD card interacts with either the BRAM memory of the DRAM memory described in Chapter 2. By routing all data entering and leaving the system through the memory systems, the system becomes robust to differing transfer rates and latency associated with different modules. 4.1 On-FPGA ROM Static Image Storage In order to test the first iteration of the system with prerecorded data, ROMs with pre-loaded gray scale static images replaced the BRAM and were compiled with the design. Testing the detector chips on static images provided an initial verification of algorithm correctness. The downside of using pre-loaded ROM is that the image must be loaded precompile and cannot be changed post-compile. Multiple compilations are required to test multiple images. Color images cannot be stored as the FPGA resources are limited. 59

60 4.2 SD Card In order to load pre-recorded data to the FPGA post-compile, a method of transferring data into the system was implemented. Currently, the SD Card is used for this transfer. The SD card loads either single frames, which can be cycled through sequentially, or full videos. The single frame incremental load interfaces with the BRAM and uses grayscale images. The video load interfaces with the DRAM and uses full color data SD Card Control The SD Card interface is a modified version of the interface developed by Luis Fernádez in [14]. An open source VDHL control module is used for SPI control of the SD Card [31]. This module operates at a maximum clock frequency of 12.5 MHz and has a transfer rate of 1.5 MBPS. The signals that control this module are generated by a state machine, which performs a handshaking procedure to ensure correct data transfer between the SD Card and FPGA, details covered in [14]. The state machine has three states: IDLE, READEIGHTBYTES, and MEM- READY. The state machine is initially IDLE. The SD Card read starts on a user input. The state machine then changes to READEIGHTBYTES and concatenates 1 byte of output from the SD control module with the previously received bytes. After receiving 8 bytes, the state machine enters the MEMREADY state, where it waits for an acknowledgment from the upper level module. After the acknowledgment is received, the state machine re-enters the READEIGHTBYTES state. The number of bytes to read from the SD Card is calculated by multiplying the number of bytes in a frame by the number of frames stored on the SD card. Both of these value are parameters that can be adjusted pre-compile. When the maximum bytes have been read the state machine enters the IDLE state. Figure 4-1 shows the a visual flow of this state machine. 60

61 Figure 4-1: SD Card controller state machine for video reading 6 outputs are concatenated into a 178 bit width word, which is zero extended to 512 bits. This word is the 16 24bit full color pixel representation that can be used by other modules of the system. Each word is assigned an address and stored in a FIFO of depth This FIFO pipelines the the transfer between the external interface and the FPGA system Single Frame Load The SD Card can be used to transfer multiple static images, which can be sequentially accessed through a button push. Loading image data to the SD card is a much faster testing process than recompiling the design. Additionally, multiple static images can be tested in the same test. To load single frames with a button push, a fourth state, PAUSE, is added to the state machine. After the number of bytes in a frame has been read, the state machine enters the PAUSE state, until the user presses the button. Then, the state machine enters the READEIGHTBYTES state. The button press also resets the address counter to ensure the first data of the new frame is written to the first address. The updated state machine is shown in Figure 4-2. Currently, this reading functionality 61

62 is integrated with the BRAM on-chip memory. Figure 4-2: SD Card controller state machine with PAUSE state included Video Load To load a full video, the initial SD state machine remains the same. The address continually increments until it reaches the maximum frame address at which point it resets. The SD card is read until the full video is loaded. The number of frames of video data can be updated pre-compile. 1GB of DRAM can hold 166 6MB frames. In the current system, a maximum of 131 frames can be loaded to DRAM due to address constraints. The transfer rate from the SD card of 1.5 MBPS results in a frame transfer of 0.25 FPS. As this frame rate will severely bottleneck the system, the entire video is loaded into DRAM off-line and then read by the detector and display modules. This process requires a different DRAM arbitration scheme. All interfaces to the memory module remain the same. 62

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015 Optimization of Multi-Channel BCH Error Decoding for Common Cases Russell Dill Master's Thesis Defense April 20, 2015 Bose-Chaudhuri-Hocquenghem (BCH) BCH is an Error Correcting Code (ECC) and is used

More information

Block Diagram. 16/24/32 etc. pixin pixin_sof pixin_val. Supports 300 MHz+ operation on basic FPGA devices 2 Memory Read/Write Arbiter SYSTEM SIGNALS

Block Diagram. 16/24/32 etc. pixin pixin_sof pixin_val. Supports 300 MHz+ operation on basic FPGA devices 2 Memory Read/Write Arbiter SYSTEM SIGNALS Key Design Features Block Diagram Synthesizable, technology independent IP Core for FPGA, ASIC or SoC Supplied as human readable VHDL (or Verilog) source code Output supports full flow control permitting

More information

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report

ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras. Final Design Report ECE532 Digital System Design Title: Stereoscopic Depth Detection Using Two Cameras Group #4 Prof: Chow, Paul Student 1: Robert An Student 2: Kai Chun Chou Student 3: Mark Sikora April 10 th, 2015 Final

More information

VID_OVERLAY. Digital Video Overlay Module Rev Key Design Features. Block Diagram. Applications. Pin-out Description

VID_OVERLAY. Digital Video Overlay Module Rev Key Design Features. Block Diagram. Applications. Pin-out Description Key Design Features Block Diagram Synthesizable, technology independent VHDL IP Core Video overlays on 24-bit RGB or YCbCr 4:4:4 video Supports all video resolutions up to 2 16 x 2 16 pixels Supports any

More information

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

FPGA Laboratory Assignment 4. Due Date: 06/11/2012 FPGA Laboratory Assignment 4 Due Date: 06/11/2012 Aim The purpose of this lab is to help you understanding the fundamentals of designing and testing memory-based processing systems. In this lab, you will

More information

Block Diagram. dw*3 pixin (RGB) pixin_vsync pixin_hsync pixin_val pixin_rdy. clk_a. clk_b. h_s, h_bp, h_fp, h_disp, h_line

Block Diagram. dw*3 pixin (RGB) pixin_vsync pixin_hsync pixin_val pixin_rdy. clk_a. clk_b. h_s, h_bp, h_fp, h_disp, h_line Key Design Features Block Diagram Synthesizable, technology independent IP Core for FPGA, ASIC and SoC reset underflow Supplied as human readable VHDL (or Verilog) source code Simple FIFO input interface

More information

Altera's 28-nm FPGAs Optimized for Broadcast Video Applications

Altera's 28-nm FPGAs Optimized for Broadcast Video Applications Altera's 28-nm FPGAs Optimized for Broadcast Video Applications WP-01163-1.0 White Paper This paper describes how Altera s 40-nm and 28-nm FPGAs are tailored to help deliver highly-integrated, HD studio

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory Problem Set Issued: March 2, 2007 Problem Set Due: March 14, 2007 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.111 Introductory Digital Systems Laboratory

More information

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Implementation of an MPEG Codec on the Tilera TM 64 Processor 1 Implementation of an MPEG Codec on the Tilera TM 64 Processor Whitney Flohr Supervisor: Mark Franklin, Ed Richter Department of Electrical and Systems Engineering Washington University in St. Louis Fall

More information

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller XAPP22 (v.) January, 2 R Application Note: Virtex Series, Virtex-II Series and Spartan-II family LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller Summary Linear Feedback

More information

EE178 Spring 2018 Lecture Module 5. Eric Crabill

EE178 Spring 2018 Lecture Module 5. Eric Crabill EE178 Spring 2018 Lecture Module 5 Eric Crabill Goals Considerations for synchronizing signals Clocks Resets Considerations for asynchronous inputs Methods for crossing clock domains Clocks The academic

More information

EXOSTIV TM. Frédéric Leens, CEO

EXOSTIV TM. Frédéric Leens, CEO EXOSTIV TM Frédéric Leens, CEO A simple case: a video processing platform Headers & controls per frame : 1.024 bits 2.048 pixels 1.024 lines Pixels per frame: 2 21 Pixel encoding : 36 bit Frame rate: 24

More information

Pivoting Object Tracking System

Pivoting Object Tracking System Pivoting Object Tracking System [CSEE 4840 Project Design - March 2009] Damian Ancukiewicz Applied Physics and Applied Mathematics Department da2260@columbia.edu Jinglin Shen Electrical Engineering Department

More information

Fingerprint Verification System

Fingerprint Verification System Fingerprint Verification System Cheryl Texin Bashira Chowdhury 6.111 Final Project Spring 2006 Abstract This report details the design and implementation of a fingerprint verification system. The system

More information

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3. International Journal of Computer Engineering and Applications, Volume VI, Issue II, May 14 www.ijcea.com ISSN 2321 3469 Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol

More information

Group 1. C.J. Silver Geoff Jean Will Petty Cody Baxley

Group 1. C.J. Silver Geoff Jean Will Petty Cody Baxley Group 1 C.J. Silver Geoff Jean Will Petty Cody Baxley Vision Enhancement System 3 cameras Visible, IR, UV Image change functions Shift, Drunken Vision, Photo-negative, Spectrum Shift Function control via

More information

A Low-Power 0.7-V H p Video Decoder

A Low-Power 0.7-V H p Video Decoder A Low-Power 0.7-V H.264 720p Video Decoder D. Finchelstein, V. Sze, M.E. Sinangil, Y. Koken, A.P. Chandrakasan A-SSCC 2008 Outline Motivation for low-power video decoders Low-power techniques pipelining

More information

Reducing DDR Latency for Embedded Image Steganography

Reducing DDR Latency for Embedded Image Steganography Reducing DDR Latency for Embedded Image Steganography J Haralambides and L Bijaminas Department of Math and Computer Science, Barry University, Miami Shores, FL, USA Abstract - Image steganography is the

More information

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory Problem Set Issued: March 3, 2006 Problem Set Due: March 15, 2006 Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.111 Introductory Digital Systems Laboratory

More information

Radar Signal Processing Final Report Spring Semester 2017

Radar Signal Processing Final Report Spring Semester 2017 Radar Signal Processing Final Report Spring Semester 2017 Full report report by Brian Larson Other team members, Grad Students: Mohit Kumar, Shashank Joshil Department of Electrical and Computer Engineering

More information

Motion Video Compression

Motion Video Compression 7 Motion Video Compression 7.1 Motion video Motion video contains massive amounts of redundant information. This is because each image has redundant information and also because there are very few changes

More information

TV Character Generator

TV Character Generator TV Character Generator TV CHARACTER GENERATOR There are many ways to show the results of a microcontroller process in a visual manner, ranging from very simple and cheap, such as lighting an LED, to much

More information

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview

Digilent Nexys-3 Cellular RAM Controller Reference Design Overview Digilent Nexys-3 Cellular RAM Controller Reference Design Overview General Overview This document describes a reference design of the Cellular RAM (or PSRAM Pseudo Static RAM) controller for the Digilent

More information

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL

UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL UNIVERSITY OF TORONTO JOÃO MARCUS RAMOS BACALHAU GUSTAVO MAIA FERREIRA HEYANG WANG ECE532 FINAL DESIGN REPORT HOLE IN THE WALL Toronto 2015 Summary 1 Overview... 5 1.1 Motivation... 5 1.2 Goals... 5 1.3

More information

A video signal processor for motioncompensated field-rate upconversion in consumer television

A video signal processor for motioncompensated field-rate upconversion in consumer television A video signal processor for motioncompensated field-rate upconversion in consumer television B. De Loore, P. Lippens, P. Eeckhout, H. Huijgen, A. Löning, B. McSweeney, M. Verstraelen, B. Pham, G. de Haan,

More information

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress

VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress VHDL Design and Implementation of FPGA Based Logic Analyzer: Work in Progress Nor Zaidi Haron Ayer Keroh +606-5552086 zaidi@utem.edu.my Masrullizam Mat Ibrahim Ayer Keroh +606-5552081 masrullizam@utem.edu.my

More information

A Fast Constant Coefficient Multiplier for the XC6200

A Fast Constant Coefficient Multiplier for the XC6200 A Fast Constant Coefficient Multiplier for the XC6200 Tom Kean, Bernie New and Bob Slous Xilinx Inc. Abstract. We discuss the design of a high performance constant coefficient multiplier on the Xilinx

More information

Frame Processing Time Deviations in Video Processors

Frame Processing Time Deviations in Video Processors Tensilica White Paper Frame Processing Time Deviations in Video Processors May, 2008 1 Executive Summary Chips are increasingly made with processor designs licensed as semiconductor IP (intellectual property).

More information

Block Diagram. pixin. pixin_field. pixin_vsync. pixin_hsync. pixin_val. pixin_rdy. pixels_per_line. lines_per_field. pixels_per_line [11:0]

Block Diagram. pixin. pixin_field. pixin_vsync. pixin_hsync. pixin_val. pixin_rdy. pixels_per_line. lines_per_field. pixels_per_line [11:0] Rev 13 Key Design Features Block Diagram Synthesizable, technology independent IP Core for FPGA and ASIC Supplied as human readable VHDL (or Verilog) source code reset deint_mode 24-bit RGB video support

More information

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics EECS150 - Digital Design Lecture 10 - Interfacing Oct. 1, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John Wawrzynek)

More information

AN-ENG-001. Using the AVR32 SoC for real-time video applications. Written by Matteo Vit, Approved by Andrea Marson, VERSION: 1.0.0

AN-ENG-001. Using the AVR32 SoC for real-time video applications. Written by Matteo Vit, Approved by Andrea Marson, VERSION: 1.0.0 Written by Matteo Vit, R&D Engineer Dave S.r.l. Approved by Andrea Marson, CTO Dave S.r.l. DAVE S.r.l. www.dave.eu VERSION: 1.0.0 DOCUMENT CODE: AN-ENG-001 NO. OF PAGES: 8 AN-ENG-001 Using the AVR32 SoC

More information

Data Converters and DSPs Getting Closer to Sensors

Data Converters and DSPs Getting Closer to Sensors Data Converters and DSPs Getting Closer to Sensors As the data converters used in military applications must operate faster and at greater resolution, the digital domain is moving closer to the antenna/sensor

More information

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method M. Backia Lakshmi 1, D. Sellathambi 2 1 PG Student, Department of Electronics and Communication Engineering, Parisutham Institute

More information

L12: Reconfigurable Logic Architectures

L12: Reconfigurable Logic Architectures L12: Reconfigurable Logic Architectures Acknowledgements: Materials in this lecture are courtesy of the following sources and are used with permission. Frank Honore Prof. Randy Katz (Unified Microelectronics

More information

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features

OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0. General Description. Applications. Features OL_H264MCLD Multi-Channel HDTV H.264/AVC Limited Baseline Video Decoder V1.0 General Description Applications Features The OL_H264MCLD core is a hardware implementation of the H.264 baseline video compression

More information

Contents Circuits... 1

Contents Circuits... 1 Contents Circuits... 1 Categories of Circuits... 1 Description of the operations of circuits... 2 Classification of Combinational Logic... 2 1. Adder... 3 2. Decoder:... 3 Memory Address Decoder... 5 Encoder...

More information

VIDEO 2D SCALER. User Guide. 10/2014 Capital Microelectronics, Inc. China

VIDEO 2D SCALER. User Guide. 10/2014 Capital Microelectronics, Inc. China VIDEO 2D SCALER User Guide 10/2014 Capital Microelectronics, Inc. China Contents Contents... 2 1 Introduction... 3 2 Function Description... 4 2.1 Overview... 4 2.2 Function... 7 2.3 I/O Description...

More information

L11/12: Reconfigurable Logic Architectures

L11/12: Reconfigurable Logic Architectures L11/12: Reconfigurable Logic Architectures Acknowledgements: Materials in this lecture are courtesy of the following people and used with permission. - Randy H. Katz (University of California, Berkeley,

More information

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy Vladimir Afonso 1-2, Henrique Maich 1, Luan Audibert 1, Bruno Zatt 1, Marcelo Porto 1, Luciano Agostini

More information

EECS150 - Digital Design Lecture 12 - Video Interfacing. Recap and Outline

EECS150 - Digital Design Lecture 12 - Video Interfacing. Recap and Outline EECS150 - Digital Design Lecture 12 - Video Interfacing Oct. 8, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof. John

More information

Viterbi Decoder User Guide

Viterbi Decoder User Guide V 1.0.0, Jan. 16, 2012 Convolutional codes are widely adopted in wireless communication systems for forward error correction. Creonic offers you an open source Viterbi decoder with AXI4-Stream interface,

More information

Figure 1: Feature Vector Sequence Generator block diagram.

Figure 1: Feature Vector Sequence Generator block diagram. 1 Introduction Figure 1: Feature Vector Sequence Generator block diagram. We propose designing a simple isolated word speech recognition system in Verilog. Our design is naturally divided into two modules.

More information

Snapshot. Sanjay Jhaveri Mike Huhs Final Project

Snapshot. Sanjay Jhaveri Mike Huhs Final Project Snapshot Sanjay Jhaveri Mike Huhs 6.111 Final Project The goal of this final project is to implement a digital camera using a Xilinx Virtex II FPGA that is built into the 6.111 Labkit. The FPGA will interface

More information

Innovative Fast Timing Design

Innovative Fast Timing Design Innovative Fast Timing Design Solution through Simultaneous Processing of Logic Synthesis and Placement A new design methodology is now available that offers the advantages of enhanced logical design efficiency

More information

FPGA Development for Radar, Radio-Astronomy and Communications

FPGA Development for Radar, Radio-Astronomy and Communications John-Philip Taylor Room 7.03, Department of Electrical Engineering, Menzies Building, University of Cape Town Cape Town, South Africa 7701 Tel: +27 82 354 6741 email: tyljoh010@myuct.ac.za Internet: http://www.uct.ac.za

More information

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER 80 CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER 6.1 INTRODUCTION Asynchronous designs are increasingly used to counter the disadvantages of synchronous designs.

More information

Block Diagram. deint_mode. line_width. log2_line_width. field_polarity. mem_start_addr0. mem_start_addr1. mem_burst_size.

Block Diagram. deint_mode. line_width. log2_line_width. field_polarity. mem_start_addr0. mem_start_addr1. mem_burst_size. Key Design Features Block Diagram Synthesizable, technology independent IP Core for FPGA, ASIC and SoC Supplied as human readable VHDL (or Verilog) source code pixin_ pixin_val pixin_vsync pixin_ pixin

More information

Video Output and Graphics Acceleration

Video Output and Graphics Acceleration Video Output and Graphics Acceleration Overview Frame Buffer and Line Drawing Engine Prof. Kris Pister TAs: Vincent Lee, Ian Juch, Albert Magyar Version 1.5 In this project, you will use SDRAM to implement

More information

A CONTROL MECHANISM TO THE ANYWHERE PIXEL ROUTER

A CONTROL MECHANISM TO THE ANYWHERE PIXEL ROUTER University of Kentucky UKnowledge University of Kentucky Master's Theses Graduate School 2007 A CONTROL MECHANISM TO THE ANYWHERE PIXEL ROUTER Subhasri Krishnan University of Kentucky, skris0@engr.uky.edu

More information

Hardware Implementation of Viterbi Decoder for Wireless Applications

Hardware Implementation of Viterbi Decoder for Wireless Applications Hardware Implementation of Viterbi Decoder for Wireless Applications Bhupendra Singh 1, Sanjeev Agarwal 2 and Tarun Varma 3 Deptt. of Electronics and Communication Engineering, 1 Amity School of Engineering

More information

Audio and Video II. Video signal +Color systems Motion estimation Video compression standards +H.261 +MPEG-1, MPEG-2, MPEG-4, MPEG- 7, and MPEG-21

Audio and Video II. Video signal +Color systems Motion estimation Video compression standards +H.261 +MPEG-1, MPEG-2, MPEG-4, MPEG- 7, and MPEG-21 Audio and Video II Video signal +Color systems Motion estimation Video compression standards +H.261 +MPEG-1, MPEG-2, MPEG-4, MPEG- 7, and MPEG-21 1 Video signal Video camera scans the image by following

More information

Upgrading a FIR Compiler v3.1.x Design to v3.2.x

Upgrading a FIR Compiler v3.1.x Design to v3.2.x Upgrading a FIR Compiler v3.1.x Design to v3.2.x May 2005, ver. 1.0 Application Note 387 Introduction This application note is intended for designers who have an FPGA design that uses the Altera FIR Compiler

More information

Achieving Timing Closure in ALTERA FPGAs

Achieving Timing Closure in ALTERA FPGAs Achieving Timing Closure in ALTERA FPGAs Course Description This course provides all necessary theoretical and practical know-how to write system timing constraints for variety designs in ALTERA FPGAs.

More information

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida Reconfigurable Architectures Greg Stitt ECE Department University of Florida How can hardware be reconfigurable? Problem: Can t change fabricated chip ASICs are fixed Solution: Create components that can

More information

Solutions to Embedded System Design Challenges Part II

Solutions to Embedded System Design Challenges Part II Solutions to Embedded System Design Challenges Part II Time-Saving Tips to Improve Productivity In Embedded System Design, Validation and Debug Hi, my name is Mike Juliana. Welcome to today s elearning.

More information

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features

OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0. General Description. Applications. Features OL_H264e HDTV H.264/AVC Baseline Video Encoder Rev 1.0 General Description Applications Features The OL_H264e core is a hardware implementation of the H.264 baseline video compression algorithm. The core

More information

Multicore Design Considerations

Multicore Design Considerations Multicore Design Considerations Multicore: The Forefront of Computing Technology We re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming

More information

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow Bradley R. Quinton*, Mark R. Greenstreet, Steven J.E. Wilton*, *Dept. of Electrical and Computer Engineering, Dept.

More information

CHECKPOINT 2.5 FOUR PORT ARBITER AND USER INTERFACE

CHECKPOINT 2.5 FOUR PORT ARBITER AND USER INTERFACE 1.0 MOTIVATION UNIVERSITY OF CALIFORNIA AT BERKELEY COLLEGE OF ENGINEERING DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE CHECKPOINT 2.5 FOUR PORT ARBITER AND USER INTERFACE Please note that

More information

Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics

Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics Certus TM Silicon Debug: Don t Prototype Without It by Doug Amos, Mentor Graphics FPGA PROTOTYPE RUNNING NOW WHAT? Well done team; we ve managed to get 100 s of millions of gates of FPGA-hostile RTL running

More information

Synchronization Issues During Encoder / Decoder Tests

Synchronization Issues During Encoder / Decoder Tests OmniTek PQA Application Note: Synchronization Issues During Encoder / Decoder Tests Revision 1.0 www.omnitek.tv OmniTek Advanced Measurement Technology 1 INTRODUCTION The OmniTek PQA system is very well

More information

MIPI D-PHY Bandwidth Matrix Table User Guide. UG110 Version 1.0, June 2015

MIPI D-PHY Bandwidth Matrix Table User Guide. UG110 Version 1.0, June 2015 UG110 Version 1.0, June 2015 Introduction MIPI D-PHY Bandwidth Matrix Table User Guide As we move from the world of standard-definition to the high-definition and ultra-high-definition, the common parallel

More information

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion Asmar A Khan and Shahid Masud Department of Computer Science and Engineering Lahore University of Management Sciences Opp Sector-U,

More information

DT3162. Ideal Applications Machine Vision Medical Imaging/Diagnostics Scientific Imaging

DT3162. Ideal Applications Machine Vision Medical Imaging/Diagnostics Scientific Imaging Compatible Windows Software GLOBAL LAB Image/2 DT Vision Foundry DT3162 Variable-Scan Monochrome Frame Grabber for the PCI Bus Key Features High-speed acquisition up to 40 MHz pixel acquire rate allows

More information

FPGA Implementation of DA Algritm for Fir Filter

FPGA Implementation of DA Algritm for Fir Filter International Journal of Computational Engineering Research Vol, 03 Issue, 8 FPGA Implementation of DA Algritm for Fir Filter 1, Solmanraju Putta, 2, J Kishore, 3, P. Suresh 1, M.Tech student,assoc. Prof.,Professor

More information

CSCB58 - Lab 4. Prelab /3 Part I (in-lab) /1 Part II (in-lab) /1 Part III (in-lab) /2 TOTAL /8

CSCB58 - Lab 4. Prelab /3 Part I (in-lab) /1 Part II (in-lab) /1 Part III (in-lab) /2 TOTAL /8 CSCB58 - Lab 4 Clocks and Counters Learning Objectives The purpose of this lab is to learn how to create counters and to be able to control when operations occur when the actual clock rate is much faster.

More information

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension 05-Silva-AF:05-Silva-AF 8/19/11 6:18 AM Page 43 A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension T. L. da Silva 1, L. A. S. Cruz 2, and L. V. Agostini 3 1 Telecommunications

More information

High Performance Carry Chains for FPGAs

High Performance Carry Chains for FPGAs High Performance Carry Chains for FPGAs Matthew M. Hosler Department of Electrical and Computer Engineering Northwestern University Abstract Carry chains are an important consideration for most computations,

More information

IT T35 Digital system desigm y - ii /s - iii

IT T35 Digital system desigm y - ii /s - iii UNIT - III Sequential Logic I Sequential circuits: latches flip flops analysis of clocked sequential circuits state reduction and assignments Registers and Counters: Registers shift registers ripple counters

More information

CHAPTER1: Digital Logic Circuits

CHAPTER1: Digital Logic Circuits CS224: Computer Organization S.KHABET CHAPTER1: Digital Logic Circuits 1 Sequential Circuits Introduction Composed of a combinational circuit to which the memory elements are connected to form a feedback

More information

Scan. This is a sample of the first 15 pages of the Scan chapter.

Scan. This is a sample of the first 15 pages of the Scan chapter. Scan This is a sample of the first 15 pages of the Scan chapter. Note: The book is NOT Pinted in color. Objectives: This section provides: An overview of Scan An introduction to Test Sequences and Test

More information

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer by: Matt Mazzola 12222670 Abstract The design of a spectrum analyzer on an embedded device is presented. The device achieves minimum

More information

FPGA Design. Part I - Hardware Components. Thomas Lenzi

FPGA Design. Part I - Hardware Components. Thomas Lenzi FPGA Design Part I - Hardware Components Thomas Lenzi Approach We believe that having knowledge of the hardware components that compose an FPGA allow for better firmware design. Being able to visualise

More information

Traffic Light Controller

Traffic Light Controller Traffic Light Controller Four Way Intersection Traffic Light System Fall-2017 James Todd, Thierno Barry, Andrew Tamer, Gurashish Grewal Electrical and Computer Engineering Department School of Engineering

More information

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath Objectives Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath In the previous chapters we have studied how to develop a specification from a given application, and

More information

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005 EE178 Lecture Module 4 Eric Crabill SJSU / Xilinx Fall 2005 Lecture #9 Agenda Considerations for synchronizing signals. Clocks. Resets. Considerations for asynchronous inputs. Methods for crossing clock

More information

Static Timing Analysis for Nanometer Designs

Static Timing Analysis for Nanometer Designs J. Bhasker Rakesh Chadha Static Timing Analysis for Nanometer Designs A Practical Approach 4y Spri ringer Contents Preface xv CHAPTER 1: Introduction / 1.1 Nanometer Designs 1 1.2 What is Static Timing

More information

THE USE OF forward error correction (FEC) in optical networks

THE USE OF forward error correction (FEC) in optical networks IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 52, NO. 8, AUGUST 2005 461 A High-Speed Low-Complexity Reed Solomon Decoder for Optical Communications Hanho Lee, Member, IEEE Abstract

More information

CHAPTER 6 DESIGN OF HIGH SPEED COUNTER USING PIPELINING

CHAPTER 6 DESIGN OF HIGH SPEED COUNTER USING PIPELINING 149 CHAPTER 6 DESIGN OF HIGH SPEED COUNTER USING PIPELINING 6.1 INTRODUCTION Counters act as important building blocks of fast arithmetic circuits used for frequency division, shifting operation, digital

More information

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components VGA Controller Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, 2012 Fig. 1. VGA Controller Components 1 VGA Controller Leif Andersen, Daniel Blakemore, Jon Parker University

More information

EITF35: Introduction to Structured VLSI Design

EITF35: Introduction to Structured VLSI Design EITF35: Introduction to Structured VLSI Design Part 4.2.1: Learn More Liang Liu liang.liu@eit.lth.se 1 Outline Crossing clock domain Reset, synchronous or asynchronous? 2 Why two DFFs? 3 Crossing clock

More information

Interlace and De-interlace Application on Video

Interlace and De-interlace Application on Video Interlace and De-interlace Application on Video Liliana, Justinus Andjarwirawan, Gilberto Erwanto Informatics Department, Faculty of Industrial Technology, Petra Christian University Surabaya, Indonesia

More information

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics 1) Explain why & how a MOSFET works VLSI Design: 2) Draw Vds-Ids curve for a MOSFET. Now, show how this curve changes (a) with increasing Vgs (b) with increasing transistor width (c) considering Channel

More information

microenable 5 marathon ACL Product Profile of microenable 5 marathon ACL Datasheet microenable 5 marathon ACL

microenable 5 marathon ACL Product Profile of microenable 5 marathon ACL   Datasheet microenable 5 marathon ACL i Product Profile of Scalable, intelligent high performance frame grabber for highest requirements on image acquisition and preprocessing by robust industrial MV standards All formats of Camera Link standard

More information

Lossless Compression Algorithms for Direct- Write Lithography Systems

Lossless Compression Algorithms for Direct- Write Lithography Systems Lossless Compression Algorithms for Direct- Write Lithography Systems Hsin-I Liu Video and Image Processing Lab Department of Electrical Engineering and Computer Science University of California at Berkeley

More information

FPGA Prototyping using Behavioral Synthesis for Improving Video Processing Algorithm and FHD TV SoC Design Masaru Takahashi

FPGA Prototyping using Behavioral Synthesis for Improving Video Processing Algorithm and FHD TV SoC Design Masaru Takahashi FPGA Prototyping using Behavioral Synthesis for Improving Video Processing Algorithm and FHD TV SoC Design Masaru Takahashi SoC Software Platform Division, Renesas Electronics Corporation January 28, 2011

More information

ISSCC 2006 / SESSION 18 / CLOCK AND DATA RECOVERY / 18.6

ISSCC 2006 / SESSION 18 / CLOCK AND DATA RECOVERY / 18.6 18.6 Data Recovery and Retiming for the Fully Buffered DIMM 4.8Gb/s Serial Links Hamid Partovi 1, Wolfgang Walthes 2, Luca Ravezzi 1, Paul Lindt 2, Sivaraman Chokkalingam 1, Karthik Gopalakrishnan 1, Andreas

More information

TABLE 3. MIB COUNTER INPUT Register (Write Only) TABLE 4. MIB STATUS Register (Read Only)

TABLE 3. MIB COUNTER INPUT Register (Write Only) TABLE 4. MIB STATUS Register (Read Only) TABLE 3. MIB COUNTER INPUT Register (Write Only) at relative address: 1,000,404 (Hex) Bits Name Description 0-15 IRC[15..0] Alternative for MultiKron Resource Counters external input if no actual external

More information

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large ESE680-002 (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance Last Time Saw how to formulate and automate retiming: start with network calculate minimum achievable

More information

MULTIMEDIA TECHNOLOGIES

MULTIMEDIA TECHNOLOGIES MULTIMEDIA TECHNOLOGIES LECTURE 08 VIDEO IMRAN IHSAN ASSISTANT PROFESSOR VIDEO Video streams are made up of a series of still images (frames) played one after another at high speed This fools the eye into

More information

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER University of Kentucky UKnowledge Theses and Dissertations--Electrical and Computer Engineering Electrical and Computer Engineering 2007 AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER Vijai Raghunathan

More information

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George Application Note: Virtex-4 Family R XAPP701 (v1.4) October 2, 2006 Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George Summary This application note describes the direct-clocking

More information

Understanding Design Requirements for Building Reliable, Space-Based FPGA MGT Systems Based on Radiation Test Results

Understanding Design Requirements for Building Reliable, Space-Based FPGA MGT Systems Based on Radiation Test Results Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2012-03-20 Understanding Design Requirements for Building Reliable, Space-Based FPGA MGT Systems Based on Radiation Test Results

More information

SPATIAL LIGHT MODULATORS

SPATIAL LIGHT MODULATORS SPATIAL LIGHT MODULATORS Reflective XY Series Phase and Amplitude 512x512 A spatial light modulator (SLM) is an electrically programmable device that modulates light according to a fixed spatial (pixel)

More information

CMS Conference Report

CMS Conference Report Available on CMS information server CMS CR 1997/017 CMS Conference Report 22 October 1997 Updated in 30 March 1998 Trigger synchronisation circuits in CMS J. Varela * 1, L. Berger 2, R. Nóbrega 3, A. Pierce

More information

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit)

Laboratory 1 - Introduction to Digital Electronics and Lab Equipment (Logic Analyzers, Digital Oscilloscope, and FPGA-based Labkit) Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6. - Introductory Digital Systems Laboratory (Spring 006) Laboratory - Introduction to Digital Electronics

More information

More Digital Circuits

More Digital Circuits More Digital Circuits 1 Signals and Waveforms: Showing Time & Grouping 2 Signals and Waveforms: Circuit Delay 2 3 4 5 3 10 0 1 5 13 4 6 3 Sample Debugging Waveform 4 Type of Circuits Synchronous Digital

More information

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter Abstract: In this paper, we analyze the contents of lookup tables (LUTs) of distributed arithmetic (DA)- based

More information

Laboratory Exercise 7

Laboratory Exercise 7 Laboratory Exercise 7 Finite State Machines This is an exercise in using finite state machines. Part I We wish to implement a finite state machine (FSM) that recognizes two specific sequences of applied

More information

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky, Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky, tomott}@berkeley.edu Abstract With the reduction of feature sizes, more sources

More information