An FPGA Platform for Demonstrating Embedded Vision Systems. Ariana Eisenstein

An FPGA Platform for Demonstrating Embedded Vision Systems by Ariana Eisenstein B.S., Massachusetts Institute of Technology (2015) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Electrical Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 c Massachusetts Institute of Technology 2016. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science May 20, 2016 Certified by............................................................ Vivienne Sze Emanuel E. Landsman (1958) Career Development Professor Thesis Supervisor Accepted by........................................................... Dr. Christopher J. Terman Chairman, Department Committee on Graduate Theses

An FPGA Platform for Demonstrating Embedded Vision Systems by Ariana Eisenstein Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2016, in partial fulfillment of the requirements for the degree of Masters of Engineering in Computer Science and Electrical Engineering Abstract This thesis presents an FPGA platform that can be used to enable real-time embedded vision systems, specifically object detection. Interfaces are built between the FPGA and a high definition (1980 x 1080) HDMI camera, an off-chip DRAM, an FMC connector, an SD Card, and an HDMI display. The interface processing includes debayering for the camera input, arbitration for DRAM, and object annotation for the display. The platform must also handle the different clock frequencies of various interfaces. Real-time object detection at 30 frames per second is demonstrated by either connecting the platform to an object detection ASIC via the FMC connector, or directly implementing the object detection RTL on the FPGA. Using this platform, ASICs developed in the Energy-Efficient Multimedia Systems lab can be verified and benchmarked on both live video via the HDMI camera as well as pre-recorded media via an SD Card. Finally, a post-processing filter has been implemented on the FPGA to reduce false positives and interpolate missed object detections by leveraging temporal correlations. Thesis Supervisor: Vivienne Sze Title: Emanuel E. Landsman (1958) Career Development Professor 3

Acknowledgments I would like to thank my supervisor Professor Vivienne Sze for allowing me the opportunity to work with the Energy Efficient Multimedia Systems Group and develop this thesis. Thank you for the support and knowledge, which greatly enhanced my problem solving and design understanding. Thank you to Amr Suleiman for an being a continuous source of advice and guidance on this project. Without you this thesis would not be possible; good luck in your future projects. Thank you to my friends at MIT for being there for me through the ups and the downs. I would not have made it without you. Finally, thanks to my family for their continued support in all my goals. 5

Contents 1 Introduction 15 1.1 Motivations for Hardware Accelerated Computer Vision Algorithms. 17 1.2 Discussions of Computer Vision Algorithms Implemented in System. 18 1.3 Overall System Design.......................... 19 1.4 Previous Work - SuperUROP 2015................... 22 1.4.1 Camera Preprocessing...................... 23 1.4.2 Memory Interface and Management............... 27 1.4.3 Detector Display......................... 28 2 DRAM Architecture and Arbitration 33 2.1 Arbitration................................ 34 2.2 Architecture................................ 35 2.2.1 Memory Interface Module States Control Signals....... 36 2.2.2 Response Labeling........................ 38 2.2.3 Write Phase and Input...................... 38 2.2.4 Display Read Phase and Output................. 38 2.2.5 Detector Read Phase and Output................ 39 2.2.6 FIFO Read and Memory Controller Communication...... 39 2.2.7 Addressing Time Multiplexed Data............... 40 3 Temporal Filtering 43 3.1 Algorithm Exploration and Selection.................. 43 3.1.1 Previous Work.......................... 44 7

3.1.2 Base Algorithm and Exploration................. 45 3.1.3 Performance Measurement.................... 46 3.1.4 Position Correlation....................... 47 3.1.5 Detection Score Thresholding.................. 47 3.1.6 Multiple Frame Correlation................... 48 3.1.7 Forward Backward Correlation.................. 48 3.1.8 Results............................... 49 3.2 RTL Implementation........................... 53 3.2.1 Box Loading and Selection.................... 53 3.2.2 Correlation Calculation and Threshold Comparison...... 54 3.2.3 Serialization and Modified NMS................. 55 3.2.4 Parametrization.......................... 56 3.2.5 Overall Performance....................... 56 4 Pre-Recorded Data Load 59 4.1 On-FPGA ROM Static Image Storage................. 59 4.2 SD Card.................................. 60 4.2.1 SD Card Control......................... 60 4.2.2 Single Frame Load........................ 61 4.2.3 Video Load............................ 62 5 Conclusion 65 5.1 Contribution................................ 65 5.2 Future Work................................ 66 5.2.1 Tracking.............................. 66 5.2.2 PC Connection.......................... 66 5.3 Summary................................. 67 A Conversions and Control Signals 69 A.1 User Interface Controller Signals..................... 69 A.2 Detector Output to Pixel Conversions.................. 70 8

List of Figures 1-1 System Block Diagram, showing the movement of data from the camera input to the display output. An additional external input of SD Card data is shown. Boundaries are drawn to show the separate clocking domains within the FPGA system.................... 19 1-2 Camera Pipeline Architecture, showing the data flow from the camera input to the memory interface, as well as the location of the BRAM memory................................... 23 1-3 a: Bayer Filter Pattern, b: Possible Three-by-Three Matrices; each of the matrices can be mapped to a three-by-three location on the Bayer pattern................................... 24 1-4 a: Down-sampled Grayscale Image, b: Grayscale image after up-sampling; after interpolating the camera data, the effects visible in (a) are no longer present............................... 25 1-5 Organization of data in frame from Vita 2000 image sensor [21]... 27 1-6 Human Understandable Detector Results................ 28 1-7 Detector Display module; Note the serialization of multiple detector results................................... 29 1-8 a: All Visible Boxes, b: Consolidated Boxes; after using the Non- Maximal Suppression modules the overlapping boxes are consolidated into single box around each person................... 29 1-9 Ping Pong Architecture, demonstrating the architecture in which data is continuously flowing into one box calculation to be used, while the other calculation is fed to the display................... 32 9

1-10 Final pixel drawing in Detector Display module; Note the clock domains of each data path. represents bitwise or; the boolean will be False if the box pixel has not been assigned a value........... 32 2-1 Memory Interface module Block Diagram showing the connection between the three modules corresponding to each external interface and the off-fpga DRAM........................... 34 2-2 Memory Interface Module Block Diagram showing three external interfaces as well as connection to DRAM. The camera clock 63 MHz, the display clock is 150 MHz, the memory clock is 200 MHz, and the DDR3 clock is 800 MHz. The detector clock depends on the detector being tested................................. 36 2-3 State machine showing the movement between phases of the arbitration. 37 2-4 UI memory controller and interface with external DRAM [32].... 40 3-1 Diagram showing the forward backward correlation scheme; the red lines indicate correlation between the current frame and previous two frames, the blue lines indicate correlations between the future frame and previous two frames.......................... 49 3-2 Comparison of results, highlighting false positive removal. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth......... 51 3-3 Comparison of results highlighting interpolated frame. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth............ 51 3-4 Diagram showing the Average Precision curve of both the detected and filtered results for a single KITTI road video. For higher accuracy (recall) values, the filter demonstrates higher precision values. Note the line at recall = 0.5........................... 52 3-5 High level Block diagram showing the major filtering sub-modules. Pipeline occurs between each block.................... 53 10

3-6 Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example.......... 54 3-7 Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example.......... 55 3-8 Block diagram showing the pipelined correlation calculation and threshold comparison............................... 56 4-1 SD Card controller state machine for video reading.......... 61 4-2 SD Card controller state machine with PAUSE state included.... 62 4-3 Arbitration state machine showing movement between phases of DRAM arbitration................................. 63 11

List of Tables 1.1 System Clocks............................... 21 1.2 Module Utilization............................ 21 2.1 Module to Memory Location....................... 41 3.1 Filtering Test Results........................... 50 13

Chapter 1 Introduction Many systems such as surveillance, advanced driver assistance systems (ADAS), portable electronics, and robotics need to process visual data collected from the world around them quickly and correctly. As these systems become smaller and more mobile, the processing will need to consume less power in order to achieve a reasonable battery life. Accelerating computer vision algorithms in dedicated hardware (i.e. ASICs [Application Specific Integrated Circuits]) allows faster processing of video data with higher resolutions and frame rates, while using less power and memory resources than traditional software based implementations. The Energy Efficient Multimedia Systems (EEMS) Group at MIT has developed several ASICs, which implement low power object detection algorithms. As described in [26, 27], these detectors operate on high definition 1920 by 1080 video data at a rate of 30-60 fps [frames per second]. However, in order for the detectors to meet these metrics, they must receive HD video input at 30-60 fps. Similarly, these detectors can produce results at clock frequency between 62.5 and 125 MHz. In order to demonstrate the chip s correctness, these results must be collected and converted into a human understandable form. Thus, the challenge is creating a suitable system, which can both provide data input and collect output from the chip, consistently, neither dropping data nor adding latency to the system. This thesis presents a platform developed on the VC707 Development Board with a Virtex-7 FPGA for testing and demonstrating complete systems using object de- 15

tection ASICs. This platform has two input configurations (Camera and SD Card). The camera provides live video input to demonstrate the detector s robustness on real-world content. The SD card transfers pre-recorded media, both videos and fixed frames to the detector. These can be used to test the ASICs on test images as well as benchmark the implemented algorithms on standard datasets. The video frame input to the system is stored off-fpga in the on board DDR3 DRAM memory. By storing the inputs in DRAM, multiple external interfaces can access the data independent to the input method. Using the DRAM memory decreases the utilization of FPGA resources in the overall system, allowing faster designs and quicker compile times. The development board platform communicates with the detector ASIC through an FMC connection. The detector operates on a frame by frame basis. The detector results are converted to boxes, which are drawn over the video data read from the DRAM. The final display is shown over the platform s one output configuration (HDMI Display). The final detector results are filtered to improve detector accuracy. This filter uses a correlation algorithm that compares the detector results over multiple frames. As the number of consecutive frames a detection appears in increases, so does the confidence of that detection. Chapter 1 outlines current approaches to the object detection, detailing the algorithms tested on this platform. This section further motivates the need for hardware accelerated computer vision algorithms. Additionally, a description of the overall system as completed as part of the SuperUROP project is given. This description includes two major modules: the camera interface, including the preprocessing of the camera data and the final results display, including both the collection of results and interface to the HDMI displayed. Additionally, the interface to BRAM memory and storing of fixed frames is discussed. Chapter 2 describes the integration of the off-fpga DRAM memory with the system. This chapter covers the complexities of designing the DRAM memory interface and methods, including the choice of controller and the methods of interface of 16

the modules that communicate with this memory. Additionally, the chapter outlines the challenges of time multiplexing the DRAM single port between three modules operating at different frequencies. The guards to ensure no data is dropped and the lowest latency is achieved are also discussed. Chapter 3 describes the development of a filtering algorithm that correlates the detection results over multiple frames. The filtering is used to both remove false positives and interpolate dropped detections, that occur for one to two frames. This process removes errors caused by noisy data sent to the detector, making the detection more robust. Additionally, this chapter includes details the RTL implementation of this algorithm and a discussion of utilization verses performance gain. Chapter 4 describes the SD Card interface, which is used to transfer prerecorded data into the system. This interface allows more diverse testing of the ASICs without needing to recompile the system by providing functionality to load different fixed images and full videos into this system post-compile, rather than encoding fixed data into a ROM pre-compile. 1.1 Motivations for Hardware Accelerated Computer Vision Algorithms Many previous software implementations of computer vision algorithms run primarily off of CPU/GPU platforms [4, 10]. The advantages of these systems are standardized camera and display management that are simple to implement and use. The disadvantage is the CPU/GPU s large power consumption and area utilization. For example, the Nvidia 8800 GTX2 uses 185W and takes up an area of 480 mm 2 at 90nm CMOS [19]. In addition, due to the processing times of these systems, they cannot operate at high resolutions. The CPU and GPU implementations proposed by R. Benenson et al. only work at a resolution of 640 x 480 despite running at high frame rates [2]. Recently, several vision systems have been developed for the FPGA to achieve 17

higher throughput. These systems tend to use a high on-chip memory, leading to higher hardware costs. The lower end of these implementations is in [28], which uses 1.22 Mbit on-chip memory. Additionally, many systems with lower memory costs only do single scaled detection. Single scale means that detection can only occur at a single, fixed distance from the camera and people at distances different from the specified would not be detected. To operate on real world data, these systems require customized input and output connections in order to achieve the stated performance. 1.2 Discussions of Computer Vision Algorithms Implemented in System The implementations discussed above use a variety of algorithms. Object detection algorithms have ranged from using cascades of simple features [29] to using edges and chamfer distances [15] to determine the presence of an object in the frame. The Histogram of Oriented Gradients (HOG) method was proposed by [8] and relies on invariant object features, mapped to a set of nine directed gradients, for detection. A detection score is generated using the dot product of the features and classifier weights. The HOG algorithm can be modified for higher performance by using a Deformable Parts Model (DPM), as demonstrated by Felzenszwalb et. al. in [12]. DPM creates collections of parts which correspond to pieces of the object. These parts are connected by spring-like connections allowing for deformation. In [13], DPM is shown to increase detection accuracy over single component models such as HOG. Hardware accelerated implementations of the HOG and DPM algorithms, including ASIC and FPGA, have been tested with this platform. Both implementations process high definition 1920 by 1080 video data. The HOG object detection implementation, developed by Suleiman and Sze in [26], supports multi-scale objects, meaning objects at a variety of distances can be detected. It achieves an average precision (AP) of 36.9% on the INRIA person data set [16], which is comparable to the original HOG AP of 40% [8]. For a 45nm SOI 18

process, the HOG implementation reaches 60 fps based on post-layout results. RTL for this HOG design was implemented directly on the FPGA. The DPM object detection implementation, developed by Suleiman, Zhang, and Sze in [27], supports multi-scale objects that are tolerant to deformation. It achieves an AP of over 80% of the INRIA person data set, and an AP of 26% on PASCAL VOC 2007 object data set [11]. The DPM implementation reaches 30 fps in a 65nm process and was tested as an ASIC connected to the platform. 1.3 Overall System Design Figure 1-1: System Block Diagram, showing the movement of data from the camera input to the display output. An additional external input of SD Card data is shown. Boundaries are drawn to show the separate clocking domains within the FPGA system. 19

The goals for the SuperUROP project were to create a platform to connect the ASIC based detector to the physical world. As discussed above this platform was built on the VC707 development board with a Virtex-7 FPGA. The external interfaces to the board, Camera 1, HDMI Display 2, SD Card 3, and Detector, transfer different data types at different frequencies. The challenges of creating this platform are converting data from each interface to an intermediate type that could be stored off-fpga and used by the other modules of the system. The high level block diagram of the overall system is shown in Figure 1-1. The architecture is divided into four major modules: Camera Preprocessing, Memory Interface, Detector Display, SD Card. These will be referred to as the camera module, memory module, display module, and SD module respectively. The Detector will be referred to as the detector module. When discussing the processes completed in each module and associated sub-module, the frame signifies a single frame from the video feed; rows signify horizontal rows of pixels; columns signify vertical columns of pixels; and pixels signify the pixels in the individual frame. The clocks for each module will be defined within the section discussing that module, the frequencies are shown in Table 1.1. Within the main modules, each individual sub-module is pipelined to processes data at the highest throughput possible with the least latency, given its input and output constraints. Primarily, all sub-modules of a given module operate in the same clock domain. Asynchronous FIFOs are used to communicate between modules as well as add robustness and modularity to the design. As shown in Figure 1-1, the display module performs processing in both clock domains. A double registering process is used to cross between clock domains in this module. 1 Initial Implementation by José E. Cruz Serrallés. 2 Initial Implementation by José E. Cruz Serrallés. 3 Initial Implementation by Luis A. Fernández Lara 20

Table 1.1: System Clocks Clock Frequency (MHz) System 200 Camera 63.47 Detector Varies based on the detector Display 150 SD 100 Clocks used within the system. The Detector clock depends on the implementation being used. The HOG implementation uses a clock frequency of 103.5 MHz and the DPM ASIC uses a clock frequency of 62.5-125 MHz. Table 1.2: Module Utilization Component LUT Reg BRAM DSP VC707 Total 303600 607200 1030 2800 System Total 55931 (18.42%) 41344 (6.81%) 252.4 (24.51%) 280 (10.00%) Camera Preprocessing 2172 (0.71%) 2162 (0.36%) 2 (0.19%) 0 (0%) Detector Display 19549 (6.44%) 6564 (1.08%) 0 (0%) 124 (4.43%) Memory Interface 706 (0.23%) 2023 (0.33%) 186 (18.06%) 0 (0%) Memory Controller* 10741 (3.54%) 7787 (1.28%) 2 (0.19%) 0 (0%) Filtering 14268 (4.70 %) 13686 (2.25%) 3 (0.23%) 73 (2.61%) SD Card 463 (0.15%) 1011 (0.16%) 12 (1.17%) 0 (0%) Overall report of utilization of the components of the system. Percentage of total FPGA resources is shown in parenthesis. Filtering is not included in detector display. *IP core provided by Xilinx. The total utilization of the current system as well as details of significant modules is shown in Table 1.2 with a reference to total available utilization of the Virtex-7 21

FPGA. When designing this system, minimization of on-fpga BRAM was prioritized. BRAM space on the FPGA is limited and high on-fpga BRAM utilization leads to longer routing between components, which increases latency and compile time. The Virtex-7 is a large FPGA so minimization utilization of LUTs and Registers was not prioritized. The system input is data from a Vita 2000 1920 by 1080 pixel camera operating at 92 frames per second. The camera data is sent to the FPGA platform over an FMC connection. The camera preprocessing sub-modules transform the data into usable data for the detector and display modules. An additional input mode transfers data stored on an SD card and serializes the data into usable data for the detector and display modules. As shown in Figure 1-1, the processed data is sent to the memory module to be stored in off-chip DRAM. Both the detector module, described in Section 1.2, and display module read pixels from the DRAM via the memory module. All interfaces to the DRAM store pixel data as 16 24bit pixel values per 512 bit word. Thus, DRAM usage allows for a modular and pipelined design. The different modules that read and write from the DRAM can be changed as long as the correct pixel representation is maintained on the input and output of the memory module. Using the design, switching between camera and SD card input or different detector chips requires changing the inputs to the DRAM module pre-compile. The detector module outputs coordinates and sizes, corresponding to regions of the frame where an object is detected. These results are transferred to the display module. The display sub-modules consolidates the results and transforms them into pixel coordinates. These pixels are drawn over the real-time video feed and displayed on an HDMI high definition display. 1.4 Previous Work - SuperUROP 2015 The following subsections will discuss the design of the Camera Preprocessing module, Memory Interface module, and Detector Display module. These modules were 22

completed as part of the SuperUROP project. 1.4.1 Camera Preprocessing The Camera Preprocessing module transforms the camera data into the 16 (24bit per pixel) full color representation stored in the DRAM. Each pixel stores 8bits of red, green, and blue color data. The camera has a maximum frame rate of 92 frames per second and outputs data in a Bayer Filter pattern (Figure 1-3.a). In order to maintain a high throughput, the preprocessing could not introduce significant latency into the system. Thus each module is pipelined for maximum efficiency. The block diagram for the sub-modules in the camera module is shown in Figure 1-2. Each module discussed in this section operates at the camera clock frequency (Table 1.1). The total utilization of the camera module is given in Table 1.2. Figure 1-2: Camera Pipeline Architecture, showing the data flow from the camera input to the memory interface, as well as the location of the BRAM memory. While the Bayer pattern allows the camera data to be down-sampled and represented with less bytes, the gradient image calculation used by the detector cannot be computed correctly with the down-sampled image. Without any preprocessing, the downsampled image when converted to grayscale as shown in Figure 1-4.a. This image breaks the person outline, and the object detector does not work properly. The image is upsampled using color interpolation, where full 24bit red, green, and blue (RGB) values are generated for each pixel. This interpolation processes 16 pixels in parallel at a rate of 3 cycles per interpolation. The module which completes these operations is labeled Debayer in Figure 1-2. 23

RGB Interpolation from Bayer Filter Pattern The interpolation of the 24bit RGB pixel value is performed by averaging the red, green, and blue values in the three-by-three matrices surrounding each desired pixel, using the method and algebra described in [23]. Four different three-by-three configurations are possible; each one is shown in Figure 1-3.b. The selection of these matrices depends on both the column and row location of the pixel within the frame. These locations are tracked by incrementing counters for each pixel entering the camera preprocessing module. The averaging calculation is completed in a single clock cycle. Figure 1-3: a: Bayer Filter Pattern, b: Possible Three-by-Three Matrices; each of the matrices can be mapped to a three-by-three location on the Bayer pattern. Data must be used sequentially or will incur a memory cost. One of the challenges of this interpolation is generating the surrounding matrix for each pixel. The averaging at each pixel requires pixels both sequentially before and sequentially after the center pixel to complete the matrix. A Block RAM (BRAM) stores three rows of the frame and is pipelined to continuously update as pixels are moved into the camera preprocessing module. Each BRAM location stores sixteen pixels as represented by the Bayer filter pattern. Therefore, this BRAM has width 128 (16 pixels, 8bits per pixel) and depth 360 (number of groups of 16 pixels in three frame rows). To calculate the interpolation for the 16 pixels at a given address a, first, the pixels at location a in BRAM are read and stored in registers. These represent the center row of the 24

interpolation matrix. Locations a + 120 and a + 240 are read in the next two cycles representing the the bottom and top rows of the interpolation matrix respectively. The address wraps if it exceeds 360, thus the row being read will wrap. The data output is delayed by a single clock cycle to allow processing of the right most pixel in the word of 16 pixels. Each pixel word is processed in four cycles, three clock cycles to read each row and one cycle of delay. Thus, color interpolation has a throughput of sixteen pixels every four clock cycles or 12 bytes per cycle. Figure 1-4: a: Down-sampled Grayscale Image, b: Grayscale image after up-sampling; after interpolating the camera data, the effects visible in (a) are no longer present When performing the interpolation, the corner cases on the top and left side of the frame require special treatment. On the left side of the frame, at the beginning of each row, values could not be interpolated until the final BRAM memory access. This delay caused missing and incorrect values for the first two columns in each row. In order to fill in the missing leftmost color values, a special interpolator, capable of operating on a smaller 2x3 matrix, was created and used to generate the missing pixels, as well as logic to send the necessary data to these interpolators. To ensure the camera data is stored in the correct location in Memory, the reset signals of both the Debayer BRAM and Memory are synced to the camera frame start signal. The final result is the up-sampled as shown in Figure1-4.b, without noise effects. The Debayer module is pipelined such that it reads data from the BRAM, while 25

sending data to be written in the Memory, generating the interface addresses for both simultaneously as shown in Figure 1-2. Thus, the total latency on the data path from the camera to the memory module is 7 cycles. This delay is absorbed by the memory module, and the full frame appears correctly. Additionally, if further data compression is necessary, we convert the RGB data to the YUV colorspace. The detector uses grayscale images, and luminance (Y) values give more accurate grayscale values than simply averaging the red, green, and blue values. By discarding the color components (U,V), the amount of data moving through the system is reduced by 1. 3 RGB to YUV Conversion The RGB to YUV conversion is computed using the full conversion matrix and the following calculations: Y 76 150 29 R U = 43 84 127 G (1.1) V 127 106 21 B Y t = (Y + 128) >> 8 Ut = (U + 128) >> 8 (1.2) V t = (V + 128) >> 8 Y u = Y t Uu = Ut + 128 (1.3) V u = V t + 128 The matrix in Equation 1.1 uses whole number calculations to simplify computation in hardware. Equation 1.2 scales the values to 8bit integers with rounding. Additional offsets are added in Equation1.3 to eliminate negative values, further de- 26

Figure 1-5: Organization of data in frame from Vita 2000 image sensor [21] creasing the hardware complexity of the system. This sub-module has a throughput of 16 pixels per cycle. Thus, the total latency of the camera pipeline is increased to 8 cycles. The throughput decreases from 12 bytes per cycle to 8 bytes per cycle. 3 The camera outputs dummy data at the end of each row and end of each frame, while the camera is syncing [21]. Figure 1-5 details the pattern of this dummy data. This dummy data is not marked as valid and therefore not sent to the memory module. 1.4.2 Memory Interface and Management The superurop project implemented the system frame storage with the on-fpga Block RAM (BRAM). The Virtex-7 has about 9MB of BRAM on-chip [33]. In practice, less than 5MB of BRAM could be used in this system due to routing delays cause by high FPGA utilization. The system requires 4MB of BRAM to store two grayscale frames of size 2MB. Two frames must be stored because the BRAM IP cores provided by Xilinx only support up to two ports. One frame buffer is read by the detector at the detector clock frequency; one frame buffer is read by the display at 27

Figure 1-6: Human Understandable Detector Results the display clock frequency (Table 1.1). As mentioned above, a 2MB frame can only store grayscale frames. Each BRAM location stores 16 Y values, 8bit each. Full color frames require 6MB on-fpga BRAM, which is beyond the FPGA availability of this system. Thus, in this configuration only grayscale frames can be shown on the final display. 1.4.3 Detector Display The detector display module transforms the detection results into a human understandable format as shown in Figure 1-6. The block diagram of the sub-modules contained in the display module is shown in Figure 1-7. The results are converted from a box representation to pixels coordinates. These pixels move from the detector clock domain to the display clock domain to be drawn on the frame output. The final results are displayed in real time, providing a more convincing result than simulations 28

Figure 1-7: results Detector Display module; Note the serialization of multiple detector alone. The display module input is the object box representation generated by the detector and the frame stored in the off-fpga DRAM. Multiple detection results can be produced in the same cycle to increase the throughput of multi-scale detection. The outputs of these detectors are serialized so each set of results can be processed separately and the display is robust to different detectors. The detector data is converted into pixel data that reflects the size and location of detected boxes in the frame in pixel coordinates. This process is described in detail in Appendix A.2 and takes a single clock cycle operating at the detector clock frequency (Table 1.1). The boxes are consolidated using a non-maximal suppression (NMS) algorithm. Figure 1-8: a: All Visible Boxes, b: Consolidated Boxes; after using the Non-Maximal Suppression modules the overlapping boxes are consolidated into single box around each person The NMS algorithm used in the architecture compares two boxes at a time to 29

determine overlap. An algorithm that performs a local maximum search to isolate the areas of interest, NMS has multiple applications in computer vision as discussed by [18]. This system applies NMS in the same manner as [8] to isolate object instances. Without this algorithm, many overlapping boxes would be drawn over the video feed as shown in Figure 1-8.a. Non-Maximal Suppression The Non-Maximal Suppression (NMS) sub-module collects results associated with the same object for an enhanced final display. A naive approach consolidates any boxes with overlap. This method does not work in this implementation, as objects in close proximity may cause overlapping boxes. An example of this behavior is visible in Figure 1-8.b. To account for this effect, the areas of intersection of two boxes and the area of the union of two boxes is calculated. These values form a ratio, intersection over union, which is compared to a threshold to determine where the boxes are associated with the same object. In this system, the threshold is 1. The calculation for two boxes, b1 and b2, is 2 shown below: iw = max(b1x0, b2x0) min(b1x1, b2x1) ih = max(b1y0, b2y0) min(b1y1, b2y1) uw = min(b1x0, b2x0) max(b1x1, b2x1) uh = min(b1y0, b2y0) max(b1y1, b2y1) ib = iw * ih ua = uw * uh ua < 1 * ib (1.4) threshold where ua is the union box area and ib is the intersection box area. The calculation in Equation 1.4 is computed using multiplication and comparison, avoiding costly division. Additionally, any box completely contained by the another box is immedi- 30

ately characterized by the larger box. These calculations take a single clock cycle to complete. In the current implementation, thirty individual person boxes can be displayed in a single frame. This value is parametrizable and can be adjusted pre-compilation to suit the needs of the application. For the remainder of this section, thirty will be used as the number of boxes. The comparison of the new box with the stored boxes has two possible results. First, the box overlaps with the stored boxes. In this case, the incoming box is averaged with the stored box. This average is stored. Second, the box does not overlap with the stored boxes. If all thirty displayed boxes are filled with an object box, the incoming box is unused. Otherwise, it is added as one of the collection of thirty displayed boxes. The set of 30 boxes is continuously updated during each frame. A ping-pong architecture is used to parallelize box calculation and box display. This architecture continuously calculates a set of stored boxes, while a different set of stored boxes is displayed. Using this method causes a single frame delay between the results from the detector and the results being displayed. The final results are overlaid on each corresponding frame as shown in Figure 1-8.b. Ping-Pong Architecture In order to maintain a correct and consistent display, the boxes drawn to the frame are fully updated before the frame is displayed, using a ping-pong architecture. A block diagram of this architecture is shown in Figure 1-9. After the detector finishes processing a frame, the boxes being calculated are displayed, and a new set of boxes is calculated. Thus, the calculated boxes are not shown on screen until a full frame is processed. Since the updating process and the display process happen simultaneously, there is only a single frame delay between the information from the detector and the information being displayed. 31

Figure 1-9: Ping Pong Architecture, demonstrating the architecture in which data is continuously flowing into one box calculation to be used, while the other calculation is fed to the display. The final box results are transformed into a set of pixel locations. A single color is drawn at each of the locations in the frame to display the boxes. This tasks requires matching the pixel locations to the correct locations in each frame. The data flow is shown in Figure 1-10. If the location contains a colored pixel, the colored pixel is displayed, otherwise display the frame pixel is displayed. The comparison is challenging because colored pixel calculation is performed at the detector clock frequency and the frame pixels are checked at the display clock frequency. The modules described operate on the detector clock and the final box pixel values are converted to the display clock domain through double register synchronization. Additionally false timing paths are declared between the two clocks to satisfy the system timing constraints. Figure 1-10: Final pixel drawing in Detector Display module; Note the clock domains of each data path. represents bitwise or; the boolean will be False if the box pixel has not been assigned a value. 32

Chapter 2 DRAM Architecture and Arbitration Off-chip DDR3 DRAM is used to decrease the utilization of the on-fpga memory. When using the DRAM, the system can store and read multiple frames of data as the DRAM has 1 GB of memory available, in comparison to the 9MB of on-fpga BRAM. Although DRAM is bigger and can store more data than BRAM, DRAM read requests have more latency and streaming data is more difficult. Removing the BRAM utilization decreases the overall FPGA utilization significantly. Lower FPGA utilization allows shorter routes between components and shorter compile times. Shorter routes decreases the routing delays between components, allowing the system to be clocked faster. The full utilization of the DRAM Memory Interface is shown in Table 1.2. The memory module s utilization is dominated by the IP core provided by Xilinx. The Memory Interface module directs data from the other modules in the system to the underlying Xilinx Memory Controller. A block diagram of the placement of the memory interface module in the system is shown in Figure 2-1. This module is designed with several layers of abstraction, to be robust to changes to the interfaces that send it requests. The DRAM is a single port device. As such, only one block can send a request to the DRAM each cycle. Thus, the memory module has an arbitration with three phases of operation: write phase, detector read phase, and display read phase. 33

Figure 2-1: Memory Interface module Block Diagram showing the connection between the three modules corresponding to each external interface and the off-fpga DRAM. 2.1 Arbitration The camera, detector, and display blocks must each access the frame data stored in the system. Each of these blocks has a different bandwidth. As such the DRAM acts a buffer between these blocks. The DRAM is a single port device. As such, only one block can send a request to the DRAM each cycle. As the system contains three different modules that must communicate with the DRAM, a form of arbitration is necessary to determine which block s requests are sent to the DRAM during each clock cycle. The memory interface module has three phases of operation: write phase, detector read phase, and display read phase. The phase with the highest bandwidth is given the highest priority; the next highest bandwidth is given the next priority; and the lowest bandwidth is given the lowest priority. Bandwidth is defined in MBPS [megabytes per second] or GBPS [gigabytes per second]. The camera has a bandwidth of 546 MBPS and the display has a bandwidth of 356 34

MBPS. The detector is assumed to have a bandwidth equal to or less than the display bandwidth, but depends on the detector ASIC being tested. Thus, the detector is assumed to have the lowest bandwidth in the system. As such the priorities of DRAM accesses of the system are camera, display, and detector. The required bandwidth for the system is the sum of these values. The maximum required bandwidth is about 1300 MBPS. The bandwidth of the DDR3 on the VC707 development board is 12.5 GBPS [34], and thus this scheme is feasible. From each module, a stream of requests is generated independently. The camera generates a stream of write requests. The display and detector expect a stream of read responses. The arbitration satisfies the condition that none of these streams are stopped. The camera block can always write data, and data is always available for the display and detector blocks to read. In order to maintain this condition, the camera s write requests and the display s and detector s read responses are stored. The memory interface module sends the camera write requests to DRAM after the number of write requests stored increases above a certain level. This behavior ensure that the camera can continue to make write requests without any request being dropped. The memory interface module sends display or detector interface read requests to the DRAM after the number of read responses decreases below a certain threshold. The behavior ensures that both the display and detector consistently have data available to be read. 2.2 Architecture The previous section gave a high level overview of the arbitration required to time multiplex between three blocks that operate with different bandwidths. The data from each of these blocks is stored in an asynchronous FIFO of depth 4096. This FIFO both converts the data between different clock domains and makes the memory module robust to the different clock frequencies possible for each interface. The FIFOs are sized to be robust to the different module bandwidths without overflow or underflow. The memory interface module interfaces with the DRAM via a memory 35

controller provided by Xilinx. The system requires that a complete frame must be transferred to the detector from the memory controller at 356 MBPS in order to maintain 60fps performance in the detector and demonstrate real time results. A block diagram of the architecture of the Memory Interface module is shown in Figure 2-2. Figure 2-2: Memory Interface Module Block Diagram showing three external interfaces as well as connection to DRAM. The camera clock 63 MHz, the display clock is 150 MHz, the memory clock is 200 MHz, and the DDR3 clock is 800 MHz. The detector clock depends on the detector being tested. 2.2.1 Memory Interface Module States Control Signals The memory interface module can be in three different states: write phase, detector read phase, or display read phase, corresponding to the block whose requests are being sent to DRAM. Each of phases require different control signals to be sent to the FIFOs and DRAM controller to ensure that the content of each request is correct. The arbitration module assigns these control signals and operates at the Memory 36

Interface module bandwidth given in Section 2.1. A detailed discussion of the DRAM control signals is provided in Appendix A.1; this section will refer to read and write requests and read responses. A state machine showing the transitions between each phase is shown in Figure 2-3. Figure 2-3: State machine showing the movement between phases of the arbitration. In the write phase, the read enable signal for the camera FIFO goes high. The data from the FIFO is broken into address and write data and is sent to the DRAM in a write request. The FIFO uses a first-word fall through, meaning data is available on the same cycle the read enable goes high. In the detector read phase and display read phase, a read request is sent to the DRAM using the internally generated detector read address or display write address. If a valid read response is available, the detector FIFO or display FIFO s write enable signal goes high and the data is stored in the FIFO. 37

2.2.2 Response Labeling A challenge in the arbitration module is determining to which block the response from DRAM is sent to. The latency between a read request and a read response is variable due to different DRAM access patterns; switching pages, switching between read and write requests, and refreshes of the DRAM will negatively impact throughput [32]. As such, the data can not be directly linked to a specific request. In order to ensure correct functionality, a new protocol counts the number of requests made by each module. The arbitration labels all responses as one module until the number of responses is equal to the number of requests that module sent previously. 2.2.3 Write Phase and Input The camera data is addressed to a location in the frame in the camera module. Both this address and the associated data associated are sent to the camera block. The concatenation of address and data is stored in asynchronous FIFO. The memory interface module enters write phase when this FIFO has over 4000 entries. As the arbitration module has a much higher bandwidth than the camera block, the number of entries in the FIFO decreases. The memory interface module will not read the camera FIFO if it contains less than 600 entries, because otherwise the FIFO will underflow. 2.2.4 Display Read Phase and Output As discussed above, the camera data has highest priority in the arbitration. As such, the memory interface module will always switch to the camera read phase if the write FIFO empties below 600 entries. Otherwise, the memory interface module can enter display read phase if the display FIFO contains less than 4000 entries and remains in the display read phase. The behavior maintains the priority of the camera write requests over the display read requests. As read responses are generated by the DRAM and labeled as the display block s, they are stored in the display FIFO. This FIFO is read by the display block once every 16 display clock cycles. As the arbitration module 38

generates the display read requests, the display FIFO fills with at the bandwidth of the arbitration module and empties at the bandwidth of the display block. As the arbitration module has a much higher bandwidth than the display block, the display FIFO fills while the memory module is in display read phase and will not underflow. Display read requests are only made while the memory interface module is in display read phase. As such the display FIFO does not overflow. 2.2.5 Detector Read Phase and Output As with the Display Read Phase, the memory interface module will always switch to the camera read phase if the write FIFO empties below 600 entries. Otherwise, the memory module can enter the detector read phase if the detect FIFO has less than 4000 entries and the display FIFO has greater than 600 entries. The memory module remains in detector read phase unless the display FIFO empties below 600 entries which causes a shift to display read phase. This behavior maintains the priority of camera write requests and display read requests over detector read requests. As read responses are generated by the DRAM and labeled as the detector block s, they are stored in the detector FIFO. As the arbitration module generates the detector read requests, the detector FIFO fills with at the bandwidth of the arbitration module and empties at the bandwidth of the detector module. As the arbitration module has a much higher bandwidth than the detector module, the detector FIFO fills while the memory interface module is in display read phase and will not underflow. Detector read requests are only made while the memory module is in detector read phase. As such the detector FIFO does not overflow. 2.2.6 FIFO Read and Memory Controller Communication The DDR3 DRAM memory controller is generated using the Memory Interface Generator (MIG) IP provided by the Vivado development environment[32]. Specifically, the User-Interface (UI) controller [32] was selected due to its level of control of the DRAM signals. A diagram of this controller is shown in Figure 2-4. The UI controller s in- 39

puts are a request address, mode of operation, and enable signal. The controllers outputs are a data response and a valid bit. DRAM memory bursts and optimal address ordering are performed internally. Additionally, this controller converts data from the system clock domain (Table 1.1) to the DRAM frequency of 800MHz. This memory controller directly calls the DRAM and is assumed to function according to Xilinx specifications. Figure 2-4: UI memory controller and interface with external DRAM [32] 2.2.7 Addressing Time Multiplexed Data To ensure that no data is overwritten before the desired module accesses the memory location, the frames of video data are stored in different locations in the DRAM memory. After the system begins operation, three separate frames of video data are stored in sequential frame-sized locations in the DRAM. These locations are designated A,B,C. Which module writes to each location is determined by an internal state machine. Initially, the camera module writes a frame of video data to location A. When the final detector address signals an arbitration reset, the state switches. Now, 40

the camera module writes a frame to memory location B, followed by the detector module reading a frame from location A. Again the state switches, and the camera module writes to location C, the detector module reads from location B, and the display module reads from location A. This pattern continues for each frame that is written to DRAM. Table 2.1 shows this access pattern. Using this pattern leads to a frame delay between the frame read by the detector and display. Additionally, the pattern prevents a highly pipelined design which would allow reading of data as soon as it was written. This design prioritizes correctness over speed as updating location only after all three modules have either read or written a complete frame ensures that each frame is complete and has not been modified while being read or written. Additionally as the memory module operates at the system frequency, the highest frequency possible within the FPGA, the highest possible throughput for this scheme is achieved. Table 2.1: Module to Memory Location Module State 1 State 2 State 3 State 1 Camera A B C A Detector C A B C Display B C A B A diagram showing the location each module accesses in memory during each state. The state changes when all the detector module finishes reading a complete frame. There are three states: 1,2,3 in the state machine. 41

Chapter 3 Temporal Filtering Computer vision and object detection algorithms that operate on a frame by frame basis, such as the HOG and DPM algorithms developed by [26, 27], are highly susceptible to noise within the frame. This noise can create flashing effects: either a false positive that appears for one to two frames, or a true detection that is lost for one to two frames. In order to eliminate these noise effects, a correlation filter is implemented on the boxes produced by the NMS algorithm (Section 1.4). This filter correlates the boxes detected in the previous and future frames to both filter out flashing detections and interpolate dropped detections. This chapter begins with a literature review of relevant correlation algorithms culminating in the selection of a specific implementation to explore. Experimentation with different parameters in software leads to the final selection of algorithm to implement in hardware. This hardware s implementation is described and analyzed. 3.1 Algorithm Exploration and Selection Multiple algorithmic methods exist for correlating data between sequential frames. This section first explores the algorithmic approaches to correlating object detection over multiple frames as presented in the literature. From these, a general class of algorithm is selected for exploration of performance and usability in the system. Increasingly complex iterations on this algorithm were implemented and tested in 43

Matlab. The algorithm with the best performance is selected for implementation in RTL. 3.1.1 Previous Work Multiple algorithmic approaches are used to track objects over multiple frames of video data. The review focuses on algorithms that perform tracking by detection, to maintain independence of the detector and filter. Two major classes of tracking by detection algorithms exist: correlation between frames as demonstrated by [1, 6, 24] and particle filters as demonstrated by [5, 7, 20]. Particle filters require knowledge of the world coordinates, and therefore are not used in this implementation. Within the correlation algorithms, several approaches for calculating correlation are explored. Wu et al. propose the following factors for correlation in [30]: ( ) ( ) (x 1 x 2 ) 2 (y 1 y 2 ) 2 A pos (p 1, p 2 ) = γ pos exp exp σ 2 x σ 2 y (3.1) ( ) (s 1 s 2 ) 2 A pos (s 1, s 2 ) = γ size exp σ 2 s (3.2) p 1, s 1 and p 2, s 2 represent the position and size of the first and second feature being compared respectively. γ pos, σ 2 x, σ 2 y, γ size, σ 2 s are normalizing coefficients. These factors are used by Shu et al. in the following calculation of affinity matrix M [25]: M(i, j) = C(i, j)a pos (i, j)a size (i, j) (3.3) C(i, j) is the classifier comparison, which is a comparison of feature points. While these algorithms have good performance in software, achieving 71.4% detection precision and 73.5% detection accuracy on the Oxford Town Center Dataset [3], the hardware implementation of an exponential function would have less precision than a floating point software implementation and would lead to high resource usage preventing high levels of parallelism. A less computationally expensive calculation is 44

proposed by Segen and Pingali in [24]: m = dx2 + dy 2 σ 2 x + do2 σ 2 o + dc2 σ 2 c (3.4) where dx, dy are the difference in x and y position, do is the difference in orientation, and dc is the difference in curvature. σx, 2 σo, 2 and σc 2 are scaling factors. The algorithm achieves a tracking accuracy of 80% on a test sequence gathered by the authors [22]. The correlation calculation uses addition, multiplication, and division. Multiplication and addition are not costly in hardware. As the division is by a constant, the algorithm can be easily converted to use a multiplicative normalizing factor. Thus, this algorithm is selected for performance analysis within the system. An additional consideration when using a correlation algorithm is to choose how correlation increases detection confidence. Shu et al. proposes the notion of that detections must be present over a certain number of frames before they can be considered correct[25]. These detections are considered in correlations prior to this number of frames but are not drawn until they achieve a high enough certainty. Using this method, new objects can be initialized after being present for a certain number of frames. The number of frames is explored in the software exploration. 3.1.2 Base Algorithm and Exploration To use this algorithm proposed by [24] in the developed system and with detector outputs, several modifications must be made. First, the detector outputs position and scale of each object detection. Thus, the orientation and curvature terms are replaced by a scale term, representing size of the box, for the following equation: m = dx2 + dy 2 σ x + dx2 scale + dy2 scale σ scale (3.5) where, with 1 being the first box of comparison and 2 being the second, 45

dx = x 1 x 2 dy = y 1 y 2 dx scale = xscale 1 xscale 2 dy scale = yscale 1 yscale 2 The scaling factor were selected such that each term equals 1 when the values have the maximum difference. σ x = 1920 2 + 1080 2 and σ scale = (443 64) 2 + (886 128) 2, encompassing the squaring terms of Equation 3.4. Finally, correlation Equation 3.5 is modified to convert the divisions into multiplications as shown in the final equation: σ x σ scale m = σ scale (dx 2 + dy 2 ) + σ x (dx 2 scale + dy 2 scale) (3.6) The threshold value for this comparison is 0.5*10 14 *σ x *σ scale and was determined empirically. As shown in [25], using additional information from the detection can increase the confidence of the correlation. The use of detection score, a measure of detection confidence defined in Section 1.2, is explored in the software simulations. Also presented in [25], is the notion of the detection being correct only after appearing in a certain number of frames. This number and weighting of frames is explored as well. 3.1.3 Performance Measurement The software-based filtering algorithm s output performance was benchmarked against the KITTI Vision Road dataset [17]. The performance of each iteration of the algorithm was compared to the performance of solely the detection using the same data set. The two metrics considered are precision and recall, which are calculated using 46

the following equations: number correct detections precision = total boxes drawn number correct detections recall = number boxes in benchmark A box is counted as correct if it overlaps with a benchmark box by at least 50%.The results of running the detector only on one video of the KITTI Dataset are shown in Table 3.1. The ideal performance of the filter increases the precision, while maintaining a nearly constant recall in comparison to the base detection algorithm. 3.1.4 Position Correlation The simplest form of correlation is to compute the correlation between boxes in the current and previous frame. Each box in the current frame is compared to all boxes in the previous frame using Equation 3.6. The lowest value for each box in the current frame is the correlation score. The minimum score is zero implying the same box occurred in both frames. If the correlation score associated with the box is below the threshold, the box is drawn. The results of running this algorithm are shown in Table 3.1. The precision does increase, but with a decrease in recall. 3.1.5 Detection Score Thresholding In addition to thresholding on the correlation score, the detection score of the current boxes and previous boxes can be used as an additional threshold. For the boxes with the lowest correlation score, the two detection scores are averaged. As with the correlation score, if the average detection score is below a score threshold, the box from the current frame is drawn. A correlation cannot be computed between the current and previous scores as the scores do not differ greatly between different objects. The results of running this algorithm are shown in Table 3.1. With detection score thresholding, there is a small precision increase and a greater decrease in recall. This effect confirms that detection score does not differentiate between objects with 47

high accuracy. 3.1.6 Multiple Frame Correlation The previous to algorithms only required the object to be present in a single previous frame for the object to be considered correct. As discussed in Section 3.1.1, the more frames correspond to higher confidence. Multiple Frame Correlation compares the current frame to the two previous frames. In this implementation, a box is assumed to be correct (not noise) if it persists for more than two frames. The correlations to the two previous frames are done by doing a single frame correlation between the current frame and previous frame and doing a second single frame correlation between the current frame and the previous previous frame. The results of these two correlations are combined by weighting the previous frame correlation by 0.75 and the previous previous frame correlation by 0.25. This value is used in the threshold comparison to determine whether the box is drawn. The results of running this algorithm are shown in Table 3.1. Multiple frame correlation creates a more significant precision increase than using a single frame for correlation, but a slightly greater decrease in recall than single frame correlation. 3.1.7 Forward Backward Correlation The previous algorithms are designed to eliminate noisy boxes from the detection. In order to interpolate boxes that have been dropped by the detection, correlations using a future frame are used. Thus, there are four correlations performed in total as shown in Figure 3-1. First, all boxes in the current frame are checked, and the filtered boxes are stored. Then, the future frames boxes are checked; if a box detected has a correlation score above the score thresholds, and does not match any of the previously stored filtered boxes in the current frame, the box is added to the collection of filtered boxes. 48

Figure 3-1: Diagram showing the forward backward correlation scheme; the red lines indicate correlation between the current frame and previous two frames, the blue lines indicate correlations between the future frame and previous two frames. The boxes for each frame cannot be accessed until the frame has been processed by the detector. Each frame is processed sequentially. In order to access the future frame, the system delays the results by one frame. The results from the current frame are the future boxes, while the results from the previous frame are the current results.the results of running this algorithm are shown in Table 3.1. Forward Backward gives a higher recall, while increasing precision in comparison to detector only results. 3.1.8 Results From the results of the simulated algorithms, a Forward Backward Algorithm using equation 3.6 to generate a correlation score over multiple frames was chosen for hardware implementation. Correlating over multiple frames gave the greatest increases in precision of all the algorithms. The Forward Backward algorithm prevents a significant decrease in recall. 49

Table 3.1: Filtering Test Results Algorithm Detected Benchmark Missed Precision Recall Detector Only 765 576 335 0.315 0.418 Single Frame, Size only 577 577 346 0.400 0.400 Single Frame, with Score 546 577 374 0.372 0.352 Single Frame, Size, FB 735 577 332 0.333 0.425 Single Frame, Score, FB 725 577 342 0.324 0.407 Two Frames, Size only 542 577 349 0.421 0.395 Two Frames, with Score 516 577 375 0.391 0.350 Two Frames, Size, FB 675 577 336 0.386 0.418 Two Frames, Score, FB 667 577 344 0.349 0.403 A table detailing the performance of each algorithm on the KITTI dataset. FB stands for forward backward. The first three columns correspond to totals of boxes detected, boxes in the benchmark, and boxes missed by the filter, over a video. The Precision and Recall columns are calculated by the methods described above. A visual comparison of the detection results and the the filtered detection results in a single frame is shown in Figure 3-2. In the second row, a false positive caused by noise has been removed while correct boxes remain. The interpolation of dropped boxes is illustrated in Figure 3-3. In the second row, a box dropped by the detection has been interpolated by the filter. 50

Figure 3-2: Comparison of results, highlighting false positive removal. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth. Figure 3-3: Comparison of results highlighting interpolated frame. Each image in a row shows the same frame. The first column is detection results, the second filtered results, and the third ground truth. 51

Figure 3-4 shows the Average Precision (AP) curves of using the filtering algorithm verses using the detector only (Blue and Red curves respectively). These curves are generated by sweeping the detection threshold from a high to low level. The correlation threshold must be adjusted for each different KITTI video. As shown in the plot, for lower thresholds, the filter has a higher precision than the detector only, as false positives are being removed from the detection results. At high thresholds, the filtering sees worse performance as sparse detection results cannot be correlated, and the detector does not produce noisy results. The optimal threshold for high recall values occurs with lower thresholds. Thus, the filter is effective over the optimal threshold values. Above these thresholds, it is assumed filtering is unnecessary. Only considering the values with recall > 0.5, the filtering increases the average precision by 6.23% on a single video of the KITTI data set. Figure 3-4: Diagram showing the Average Precision curve of both the detected and filtered results for a single KITTI road video. For higher accuracy (recall) values, the filter demonstrates higher precision values. Note the line at recall = 0.5. 52

3.2 RTL Implementation The filter architecture uses a pipelined approach where four boxes are processed in each cycle. The new boxes are loaded from final display boxes from the NMS submodule. The boxes are selected from each of the frames as shown in Figure 3-1. The filter is divided into three pipelined phases: box loading and selection, correlation calculation and threshold comparison, and serialization and modified NMS. Pipeline occurs both in between each of these blocks and in the blocks themselves to meet the timing requirements of the system. A block diagram of the overall filter design is shown in Figure 3-5. Figure 3-5: High level Block diagram showing the major filtering sub-modules. Pipeline occurs between each block. 3.2.1 Box Loading and Selection After the detector module fully processes a frame, the results of the previous frames NMS are available for the filtering algorithm. Using a shift register architecture, the new boxes are moved into the first location, and all other entries are pushed back by one, with the final entry being discarded. The shift register is used over a FIFO or BRAM structure, for the ability to access multiple elements in the same cycle.the values in this shift register are then stored in register array structures for more direct access to the individual boxes. The loading phase has a latency of two cycles, one for shift register update and one for array register update. 53

Figure 3-6: Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example To select the set of boxes from each frame, two internal indices are incremented, representing the box to select from each frame. index_1 is the box index for the current and future frame; index_2 is the box index for the two previous frames. Accessing the two previous frames with the same index each cycle and the current and future frames with the same index each cycle creates the highest level of parallization and is the minimum number of cycles to assess all combinations of boxes. There are number of boxes 2 combinations possible between two frames. Figure 3-6 demonstrates the complete process. 3.2.2 Correlation Calculation and Threshold Comparison The block diagram of the correlation calculation and thresholding pipeline is shown in Figure 3-7. For each combination of index_1 and index_2, four correlations are computed as shown in Figure 3-1. These correlations are computed in parallel, using equation 3.6. This computation is pushed to DSP blocks during compile. The 54

output of this correlation is a 45bit integer; the bit width is increased due to the multiplicative normalization factors. The correlations between the two frames are combined with a weight of 0.75 on the first previous frame and 0.25 on the second previous previous frame. These computations also require 45bit integers. The final result is compared to the threshold defined in Section 3.1.2. The boolean result of the comparison is stored in a register. If the boolean is true, the current box is stored in a register, otherwise zero is stored for the box data. The latency for these blocks is three cycles as shown in block diagram in Figure 3-7. Figure 3-7: Block diagram showing the pipelined correlation calculation and threshold comparison. Current box is used as an example 3.2.3 Serialization and Modified NMS The four correlation results are serialized, stored in a FIFO of size 1024, and compared using a modified Non-Maximal Suppression module. This pipeline is shown in Figure 3-8. The FIFO is generate using an IP core provided by Xilinx, which has a minimum size of 1024. 1024 entries is oversized for the current filtering application, but utilizing the on-fpga internal FIFOs is a better utilization strategy. The modified Non- Maximal Suppression algorithm still takes the new box and the previously stored 55

final boxes as input. It is modified to only output an overlap threshold. If the boxes do not overlap, the new box is added to the final boxes. Figure 3-8: Block diagram showing the pipelined correlation calculation and threshold comparison. The maximum latency of this module is the maximum number of boxes stored in the FIFO, which is (number future frames + 1) * number of boxes 2, plus one cycle to store the results of NMS2. In practice, as the boxes are not written to the FIFO if neither the future nor the current box is correct, this latency can be much lower. 3.2.4 Parametrization While the parameters empirically determined for the filtering algorithm give the system adequate performance, future iterations of this sub-module can be further optimized. Considering future work, the filter module has been designed to allow a high level of parametrization pre-compile. Parameter controlled are: number of boxes per frame, number of future frames, number of previous frames, and correlation thresholds. In the current implementation, nb = 30, ff = 1, and bf = 2, where nb is the number of boxes per frame, ff is the number of future frames, and bf is the number of back frames considered. 3.2.5 Overall Performance The filtering begins when the NMS calculation completes as discussed in Section 1.4. As such, the filtered boxes are not available to be shown at the beginning of each frame. Summing the latency of each sub-modules gives the following equation: 56

latency = T (load boxes) + T (correlation, comparison) + T (serialization, NMS2) = 2 + 3 + nb 2 + (ff + 1) * nb 2 + 1 = O(nB 2 * ff ) where nb is the number of boxes per frame and ff is the number of future frames considered. In the current implementation, nb = 30 and nf = 1, so latency = 1800 cycles. The filtering module uses the detector clock, so the latency is 18μs. As each frame is displayed for 33.3 milliseconds, this latency is acceptable. The latency increases with O(nB 2 ) when changing the number of boxes in each frame and O(nF ) when changing the number of future frames. Additionally, as mentioned above, each future frame causes the detected results to be displayed a frame later than the image. The total utilization of the filter sub-module is 14268 LUTs (4.7%), 13686 Registers(2.25%), 3 36k BRAM tiles (0.29%), and 73 DSPs (2.61%). The high Register and LUT usage results from the storage and access of box data. Additionally, the large bit widths used for correlation and comparison calculation also increase register usage. This sub-module prioritizes high throughput and increased detection performance over resource utilization. For the following increases in utilization: 34% increase in LUTs, 49% increase in Registers, 1.2% in BRAM, and 35% increase in DSP, the system has increases detection accuracy by 6.23%. 57

Chapter 4 Pre-Recorded Data Load In addition to the live video camera input, the testing system can either be compiled with ROMs storing static frames or transfer pre-recorded data into the system via the SD Card. These interfaces are necessary to better test and benchmark the different detection algorithms being evaluated. The ROMs replace the BRAM memory described in Section 1.4. The SD card interacts with either the BRAM memory of the DRAM memory described in Chapter 2. By routing all data entering and leaving the system through the memory systems, the system becomes robust to differing transfer rates and latency associated with different modules. 4.1 On-FPGA ROM Static Image Storage In order to test the first iteration of the system with prerecorded data, ROMs with pre-loaded gray scale static images replaced the BRAM and were compiled with the design. Testing the detector chips on static images provided an initial verification of algorithm correctness. The downside of using pre-loaded ROM is that the image must be loaded precompile and cannot be changed post-compile. Multiple compilations are required to test multiple images. Color images cannot be stored as the FPGA resources are limited. 59

4.2 SD Card In order to load pre-recorded data to the FPGA post-compile, a method of transferring data into the system was implemented. Currently, the SD Card is used for this transfer. The SD card loads either single frames, which can be cycled through sequentially, or full videos. The single frame incremental load interfaces with the BRAM and uses grayscale images. The video load interfaces with the DRAM and uses full color data. 4.2.1 SD Card Control The SD Card interface is a modified version of the interface developed by Luis Fernádez in [14]. An open source VDHL control module is used for SPI control of the SD Card [31]. This module operates at a maximum clock frequency of 12.5 MHz and has a transfer rate of 1.5 MBPS. The signals that control this module are generated by a state machine, which performs a handshaking procedure to ensure correct data transfer between the SD Card and FPGA, details covered in [14]. The state machine has three states: IDLE, READEIGHTBYTES, and MEM- READY. The state machine is initially IDLE. The SD Card read starts on a user input. The state machine then changes to READEIGHTBYTES and concatenates 1 byte of output from the SD control module with the previously received bytes. After receiving 8 bytes, the state machine enters the MEMREADY state, where it waits for an acknowledgment from the upper level module. After the acknowledgment is received, the state machine re-enters the READEIGHTBYTES state. The number of bytes to read from the SD Card is calculated by multiplying the number of bytes in a frame by the number of frames stored on the SD card. Both of these value are parameters that can be adjusted pre-compile. When the maximum bytes have been read the state machine enters the IDLE state. Figure 4-1 shows the a visual flow of this state machine. 60

Figure 4-1: SD Card controller state machine for video reading 6 outputs are concatenated into a 178 bit width word, which is zero extended to 512 bits. This word is the 16 24bit full color pixel representation that can be used by other modules of the system. Each word is assigned an address and stored in a FIFO of depth 1024. This FIFO pipelines the the transfer between the external interface and the FPGA system. 4.2.2 Single Frame Load The SD Card can be used to transfer multiple static images, which can be sequentially accessed through a button push. Loading image data to the SD card is a much faster testing process than recompiling the design. Additionally, multiple static images can be tested in the same test. To load single frames with a button push, a fourth state, PAUSE, is added to the state machine. After the number of bytes in a frame has been read, the state machine enters the PAUSE state, until the user presses the button. Then, the state machine enters the READEIGHTBYTES state. The button press also resets the address counter to ensure the first data of the new frame is written to the first address. The updated state machine is shown in Figure 4-2. Currently, this reading functionality 61

is integrated with the BRAM on-chip memory. Figure 4-2: SD Card controller state machine with PAUSE state included 4.2.3 Video Load To load a full video, the initial SD state machine remains the same. The address continually increments until it reaches the maximum frame address at which point it resets. The SD card is read until the full video is loaded. The number of frames of video data can be updated pre-compile. 1GB of DRAM can hold 166 6MB frames. In the current system, a maximum of 131 frames can be loaded to DRAM due to address constraints. The transfer rate from the SD card of 1.5 MBPS results in a frame transfer of 0.25 FPS. As this frame rate will severely bottleneck the system, the entire video is loaded into DRAM off-line and then read by the detector and display modules. This process requires a different DRAM arbitration scheme. All interfaces to the memory module remain the same. 62