Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug

Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug Abstract We propose new hardware and software techniques for FPGA functional debug that leverage the inherent reconfigurability of the FPGA fabric to reduce functional debugging time. Traditionally, the functionality of an FPGA circuit is represented by a programming bitstream that specifies the configuration of the FPGA s internal logic and routing. The proposed methodology allows different sets of design internal signals to be traced solely by changes to the programming bitstream followed by device reconfiguration and hardware execution. Evidently, the advantage of this new methodology vs. existing debug techniques is that it operates without the need of iterative executions of the computationally-intensive design re-synthesis, placement and routing tools. In essence, with a single execution of the synthesis flow, the new approach permits a large number of the design internal signals to be traced for an arbitrary number of clock cycles using a limited number of external pins. Experimental results using commercial FPGA vendor tools demonstrate productivity (i.e., run-time time) improvements of up to 25 vs. a conventional approach to FPGA functional debugging. These results demonstrate the practicality and effectiveness of the proposed approach. I. INTRODUCTION As the cost of state-of-the-art ASIC design continues to escalate, field-programmable gate arrays (FPGAs) have become widely used platforms for digital circuit implementation. FPGAs carry several advantages over ASICs, including reconfigurability and lower NRE costs for mid-to-high volume applications. While there remains a gap between FPGAs and ASICs in terms of circuit speed, power and logic density [1], innovations in FPGA architecture, circuits and CAD tools have produced steady improvements on all of these fronts. Today, FPGAs are a viable target technology for all but the highest volume or low-power applications. The reconfigurability property of FPGAs reduces the cost associated with fixing the various functional errors that can occur the design cycle. In fact, reconfigurability changes the way that design verification is done in FPGAs when compared to this in ASICs. With ASICs, the high costs of mask changes and silicon re-spins (steppings) implies that designers spend considerable time in simulation/verification before tape-out, including, for example, simulation with post-layout extracted capacitances and cross-talk noise analysis. Conversely with FPGAs, designers rarely do postrouting full delay simulations, which are quite compute-intensive. Instead, reconfigurability allows design iterations to include actual silicon execution. Designers verify their design in hardware using the same (or a similar) FPGA they intend to deploy in the field. When design errors are discovered, the design s RTL is altered and RTL simulation may or may not be performed. This is followed by re-synthesis and executing the modified design in hardware. The time needed for design cycles in the ASIC domain is dominated by post-layout simulation and verification, whereas in FPGAs, design cycles are dominated by re-synthesis (logic synthesis, technology mapping, placement and routing) tool run-times. FPGA placement and routing can take hours or days for the largest designs [2], and such run-times are an impediment to designer productivity. With this observation in mind, in this paper, we present new techniques for FPGA functional debug that exploit the reconfigurability concept to raise productivity by reducing the number of compute-intensive design re-synthesis runs that are needed. At a high-level, our approaches work as follows: Say, for example, an engineer wishes to trace a large number, N, of a design s internal signals during functional debug, using a small number of available external pins, m (N >> m). We augment the design with additional circuitry that allow the N signals to be traced with N/m FPGA device re-configurations and hardware executions. The key value of our approach is that the design is only synthesized, placed and routed once, rather than N/m times. This is achieved by selecting the different sets of m trace signals through modifications to the FPGA s configuration bitstream (i.e. the post-routed design). f 1 f 2 f 3 f -LUT clk Logic block DFF s. s SRAM cell f 1 f 2 f 3 f (a) FPGA logic structures MUX MUX s SRAM config cells i 1 i 2 i 3 s s... s. MUX BUF i n (b) Routing structures Fig. 1. FPGA hardware structures. While all of the proposed approaches leverage reconfigurability to reduce loops through the design process, we present a number of design variants that are desirable in different scenarios, e.g. with different numbers of external pins being available for debugging, and with different availabilities of internal FPGA resources, such as block RAMs. A further contribution of this work is a new multiplexer (MUX) design scheme for FPGAs that uses significantly less area than a traditional MUX design. The new MUX is suitable for use in cases wherein the MUX select inputs are changed using the FPGA bitstream, instead of using normally routed logic signals. As compared with design re-synthesis for each group of m signals, experimental results demonstrate that our approaches improves runtime by up to 25. They also offer stability in the timing characteristics of the circuit being debugged. The remainder of this paper is organized as follows. Section II reviews background on FPGA architecture and related work on FPGA functional debug. The proposed approach to debugging is described in Section III. Section IV discusses various architectures to meet different resource constraints. Section V provides experimental results. Conclusions and suggestions for future work are offered in Section VI. II. BACKGROUND A. FPGA Architecture An FPGA is a two-dimensional array of programmable logic blocks and a configurable routing network. Combinational logic functions in FPGAs are implemented using K-input look-up-tables (LUTs), which are small memories capable of implementing any logic function of up to K variables. As shown in Fig. 1(a), each LUT in an FPGA logic block is normally coupled with a flip-flop, which can optionally be bypassed. SRAM configuration cells are programmed to specify the truth table of the logic function implemented by the LUT, as well as control the flip-flop bypass MUX. Fig. 1(b) shows a simplified view of a programmable routing structure. The inputs to the MUX attach to logic block output pins or routing conductors in the FPGA device (metal wire segments). The output of the buffer can drive a routing conductor or a logic block input. Again, SRAM configuration cells drive the select inputs on the MUX, and the SRAM values specify a particular input whose signal is driven through the buffer. Fig. 1 is intended to illustrate that the logic functionality and routing connectivity of an FPGA depends entirely on values in the programming bitstream that is shifted into the FPGA s SRAM configuration cells (which are connected in a scan chain). The programming bitstream also specifies the initial value (logic- or logic-1) for each flip-flop in the device. Our approaches to FPGA functional debug rely on making changes to the programming bitstream, without having to re-run time-consuming FPGA synthesis, place and route tools.

Select Signals Design Synthesis Place & Route FPGA Execution HDL Netlist (a) Conventional design process Select Signals Design Synthesis Place & Route HDL Instrument FPGA Execution Netlist (b) Proposed design process ALMs + Registers 25 2 15 1 5 2 8 16 32 6 128 256 Traced Nodes Fig. 3. Area overhead of SignalTap. s s 1 s n i w... i w 1 i w 2 w. out i w m Fig. 2. B. FPGA Functional Debug Conventional and proposed FPGA design process. There are two major approaches to perform functional debug with an FPGA. The first approach is to implement the complete design in a FPGA device. This is suitable for small designs that do not need to be executed in high frequency. Because of the reconfigurability, debugging modules can be easily added or modified with no cost. Lagadec and Picard present a set of circuit modifications that enhance debug capabilities [3]. Those modifications provide software-like debug features, such as hardwired watchpoints, stepby-step execution and hardwired breakpoints. However, each time watchpoints or breakpoints change, designs need to be recompiled again a run-time intensive task. In a somewhat similar manner to what is proposed in this work, Graham et al. improve debugging productivity by instrumenting FPGA bitstreams []. An embedded logic analyzer is inserted into the design without connecting to any signals. After place-and-route, the signals targeted for tracing are routed to the logic analyzer. This is done by modifying bitstreams using vendor tools. Although the approach provides more flexibility in choosing the desired internal signals for tracing, it remains a very complicated procedure. In fact, the tools relied upon in [] (Xilinx JBits) are no longer supported for modern FPGAs. Furthermore, re-routing needs to be performed when different sets of signals are selected for tracing. This means that the speed performance of the design can change significantly each time. The second approach to using FPGAs for functional debug is embedding reconfigurable logic into SoCs to enhance debug capability [5], [6]. Using its programmability, the reconfigurable logic can implement various debug paradigms, such as assertions, signal capture and what-if analysis. Those paradigms help engineers to understand the internal behavior of the chip and provide at-speed insystem debug. Engineers can instrument the reconfigurable logic onthe-fly as needed. However, with each change to the debug circuitry, recompilation is necessary, a process that incurs significant cost and overhead. Finally, several works on selecting the signals that one may wish to trace during simulation for debugging have been proposed. In [7], [8], the authors develop algorithms to select a small set of signals such that their values have a higher chance of restoring a significant fraction of the untraced states. The work in [9] attempts to select signals that can refine the debugging resolution. While most works target ASIC designs, the work in [1] is designed specifically for FPGAs. It predicts which signals may be useful for debugging and automatically instruments the design to utilize spared resources on the FPGA. Any prior work on signal selection could be used in conjunction with our approach. III. A RECONFIGURABILITY-DRIVEN APPROACH TO FPGA FUNCTIONAL DEBUG This section presents a new approach to enhance the observability of FPGA designs for functional debug. Fig. 2 shows the conventional Fig.. Multiplexer for signal selection. FPGA design process and the proposed reconfigurability-driven process in one debug session. To debug functional errors in an FPGA design, the design is first synthesized, placed and routed on the target FPGA device. The programming bitstream is generated, programmed into the FPGA, and execution commences. If unexpected behavior is observed, a set of internal signals are selected to be traced by a logic analyzer to provide more information. In the conventional debug process (Fig. 2(a)), the design needs to be recompiled and the FPGA needs to be reprogrammed. Fig. 3 shows the area overhead of Altera s SignalTap [11] logic analyzer vs. the number of signals being tapped. One can see that the overhead grows significantly as the number of monitored signals increases. Due to the area overhead of the logic analyzer, usually only a small set of signals are traced at any one time. The process is repeated until the values for all signals of interest are acquired. The main issue with this process is that it can take hours to compile large designs [12]. As such, repeated compilation can introduce significant time overhead and prolong the overall debug process. To alleviate the issue, a new design process that avoids recompilation is presented in this work, shown in Fig. 2(b). The idea is to modify the bitstream directly when different signals need to be traced. This is achieved by inserting a multiplexer in the FPGA with inputs being all signals that one potentially wants to trace. Fig. depicts a multiplexer that can select one of m groups of w signals. The select signals of the multiplexer are preset to logic- or to logic-1. Then, one can trace different signals by manipulating the bitstream to set the select signals to different constants. Since there is no re-routing required, the bitstream modifications can be done easily. As a result, the time overhead of this process is reduced to a bitstream modification followed by a bitstream downloading. downloading normally requires only seconds significantly less overhead than the traditional re-compilation approach. Another advantage of the proposed process is its negligible effect in the stability of the design. In the conventional debug process, the design is re-routed each time when different signals are selected, As a result, designers often need to readjust the design to meet the various timing constraints. Even though recent FPGA tools provide incremental compilation to preserve the engineering efforts from previous place/route steps, experiments show that timing of designs after incremental compilation can still vary. In the proposed process, because all signals one wants to trace are connected to the selection module at the beginning, only one compilation is necessary. As a net result, selecting different signals through bitstream modifications minimizes the overall impact on the performance of the design. A. An Area-Optimized Multiplexer Implementation It is well-known FPGAs are inefficient in implementing multiplexers. Therefore, in this section, a novel multiplexer implementation, optimized in the number of LUTs, is presented. The proposed

Select inputs A B C D s 1 s 2 clk debug Fig. 7. Implemented with a -bit Shift Register. Data inputs LUTs of spare output pins or LUTs. To accommodate different resource constraints, two architecture variants are presented. The first variant reduces the number of output pins by storing data in shift registers; the second variant utilizes embedded memories to reduce the number of LUTs. Detailed descriptions of each variant are discussed in the next two subsections. Fig. 5. Fig. 6. Traditional 16-to-1 MUX implementation in 6-input LUTs. Proposed 16-to-1 MUX implementation in 6-input LUTs. construction also takes advantage of the bitstream changes (described above). Fig. 5 shows a traditional 16-to-1 MUX implementation in a Stratix III FPGA (the image is a screen capture from Altera s technology mapped viewer tool). Observe that five 6-input LUTs are required. In a traditional MUX, the values of signals on the MUX select inputs can change at any time while the circuit operates. However, in the proposed design process, the selected trace signals do not need to change as the circuit operates. Rather, the set of selected signals is determined by the FPGA bitstream, and as such, may only change between device configurations. This makes an alternative MUX implementation possible one that consumes only three 6- LUTs in a 16-to-1 case. The new MUX design is based on recognizing that a LUT s internal hardware contains a MUX, coupled with SRAM configuration cells. In our design, the LUT s internal MUX forms a portion of the MUX we wish to implement (made possible owing to the the MUX select lines being held constant during device operation). Fig. 6 shows the proposed 16-to-1 MUX, where the 16 inputs are labeled (i-i15). In this case, the LUT configuration SRAM cells (i.e. the truth table) determine which MUX input signal is passed to the output. For the purposes of illustration, in Fig. 6, each LUT is labeled with the logic function needed to select the 6th MUX input (i5) to the output. Only three LUTs are required: The LUT labeled f1 passes input i5 to its output. LUT f2 can implement any logic function since its output is not observable (however, to save power f2 should to programmed to constant logic- or logic-1). LUT f3 is programmed to pass f1 to its output. The proposed design offers significant area savings relative to the traditional design, and allows signal selection via bitstream changes. IV. ARCHITECTURE VARIANTS WITH RESOURCE CONSTRAINTS Although the debugging scheme described in Section III uses areaoptimized multiplexers, a particular design may not have the luxury A. Debugging with Limited Output Pins The debugging architecture described in Section III requires multiple output pins if a group of signals is traced in one silicon execution. This approach may not be feasible in cases where the output pins are limited. Therefore, an alternative architecture that utilizes a parallel-in serial-out shift register is presented in Fig. 7. In Fig. 7, only one output pin is used. Values of the target group are loaded into the shift register in parallel in each clock cycle. Then, the system clock is stopped, and a second debugging clock, is used to shift out the stored value. There is a trade-off between the number of output pins and the test execution time. If more output pins are available, the data can be distributed into multiple shift registers which feed different output pins. This results in fewer clock cycles for retrieving data from the shift registers. This architecture can be improved to obtain all values stored in the shift registers within one system clock cycle (without stopping the system clock). Instead of shifting the data with a debug clock supplied from off-chip, one can use the on-fpga PLL to synthesize the debug clock from the system clock, with the debug clock being n times faster than the system clock, where n is the width of the shift registers. The advantage of this implementation is that the design does not need to be halted after each cycle in order to empty the shift registers. However, this approach is only feasible if the system can be operated at a low frequency. B. Debugging with Limited LUTs While using extra output pins may not be an issue, designers may want to save LUTs for the actual design. In this section, an alternative implementation that uses an embedded memory to replace the multiplexer tree is presented. The SRAM memory blocks in Altera s Stratix III FPGA support having different read and write data widths [13]. With this feature, the multiplexer tree described in Section III can be eliminated to further reduce the LUT count. In this implementation, the architecture consists of a memory controller and a m by n memory, where m is the total number of signals that one may want to trace and n determines the number of samples that can be stored. Instead of selecting the target signals through multiplexers, values of all signals are written to the memory in one write operation. When acquiring data from the memory, each read operation only reads the segments storing the values of the target signals. Fig. 8 shows an implementation for tracing four groups of four signals. The size of the memory is 32 by 16. During the write mode, a 16-bit word is written to the memory, as shown in Fig. 8(b). The write address bus width is 5 bits. After every 32 cycles, the content of the memory needs to be read out; otherwise, the old data would be over-written. During the read mode, a -bit word is read each time. Hence, the read address bus width is 7 bits (Fig. 8(c)). Assume that the group A is what we are interested in. The read address sequence for retrieving the desired data is, 1, 1, etc. Observe that the last two bits of the read address control which segments is to be read. If only one group of signals will be traced in one silicon execution, these two bits are held constant during the

Address 1 1 11 D 1 D 2 D 3 D C 1 C 2 C 3 C A B C D B 1 B 2 B 3 B (b) Memory write Fig. 8. wclk rclk re 16 16x32 RAM Mem Ctrl (a) Architecture A 1 A 2 A 3 A Address 1 1 11 A 1 B 1 C 1 Address 1 11 11 D 1 111 (c) Memory read Implemented with an embedded memory. TABLE I BENCHMARKS. ALM Reg Fmax (MHz) ALM Reg Fmax (MHz) ethernet 1323 1256 321.85 main 283 26 37.7 mem ctrl 12 151 266.95 dfsin 1396 16367 118.29 tmu 2336 325 168.63 aes 822 99 129.1 i2c 11 153 527.25 adpcm 1133 9852 11.58 rsdecoder 658 539 73.6 gsm 5816 5998 122.89 whole time. Similar to the multiplexer implementation, they can be set to a constant value and changed by altering the bitstream. There is one limitation with this implementation, namely, that Stratix III only allows widths of the write and read address buses to differ by a ratio of up to 32. Hence, for some cases, such as 128 groups of 2, two memories are required. As the result, multiplexers are still needed to select the final data from one of the memories. Nevertheless, the size of these multiplexers remains much smaller than the original multiplexer implementation. V. EXPERIMENTAL STUDY This section presents the area overhead and timing impact of the proposed structures. The structures were integrated into benchmarks selected from the OpenCores and CHStone benchmark suites [1]. The CHStone benchmarks were synthesized from the C language to Verilog RTL using a high-level synthesis tool [15]. All RTL benchmarks were then compiled using Altera s Quartus II 11., targeting the 65 nm Stratix III FPGA, with a 1 GHz timing constraint. Table I summarizes the ALM and register utilization of each original benchmark (i.e. without any debugging structures integrated). The table also shows the post-routing maximum frequency (Fmax) of the benchmarks. Note that although our experimental study targets Altera FPGAs, the proposed debugging flow is not limited to Altera, and applies equally to FPGAs from other vendors. In our experiment setup, registers in each module of each benchmark were randomly selected as tracing candidates. Combinational signals were selected if there were not enough registers, such as in the i2c benchmark. Benchmarks were modified such that traced signals were wired to the top-level of the benchmark and connected to the proposed structures. Altera s synthesis attributes, keep and noprune, were used to ensure that all signals exist after optimization. In the following discussion, the notation, m-w, represents the tracing setting where m signals are candidates for tracing and w signals are traced concurrently in one silicon execution. Experimental results of the three structures described in Section III and Section IV are presented in the next subsection, followed by an analysis of the productivity and the stability of the proposed design process. A. Area usage and timing analysis The first proposed structure is a m-to-w multiplexer. Fig. 9 depicts the area overhead and Fmax of multiplexers with various sizes. Three A 2 B 2 C 2 D 2 ALMs Fmax (MHz) Fig. 9. 1 8 6 2 Traditional 6LUT LUT 6 1 128 1 128 2 128 128 8 256 1 256 2 256 256 8 Mux Configurations 8 6 2 (a) Area Traditional 6LUT LUT 6 1 128 1 128 2 128 128 8 256 1 256 2 256 256 8 Mux Configurations (b) Fmax Area and timing analysis of area-optimized multiplexers. implementations are investigated: a traditional MUX implementation, a 6-LUT-based implementation (as proposed in Section III-A), and a -LUT-based implementation (same as proposed in Section III-A except using -LUTs instead of 6-LUTs). As shown in Fig. 9(a), the 6-input LUT implementation uses, on average, 3% fewer ALMs than the traditional MUX implementation. The -input LUT implementation can further reduce the usage of ALMs. This is because each ALM in a Stratix III device can contain two -input LUTs, and Quartus II may merge two -input LUTs into one ALM. However, there is no user control to force such an optimization to happen, and therefore, in the remaining experiments, all multiplexers in the proposed structures are implemented with the 6-input LUT approach. Fig. 9(b) shows the Fmax of each MUX implemented in isolation. Since the area-optimized implementation requires fewer ALMs to construct a multiplexer, less parasitic capacitance is introduced on the critical path. Consequently, multiplexers with the -input LUT implementation have the highest frequency in most cases. Table II reports the area usage and Fmax of benchmarks when the area-optimized multiplexer is integrated. The first column lists the benchmark name. The next eight columns are the percentage increase in ALMs and registers for each benchmark in different tracing settings. The final eight columns are the percentage of Fmax change in those cases. The area overhead is contributed not only by the additional structure, but also because we wire signals from submodules up to the top-level module. Results show that in most cases the area overhead is less than 1%. Note that the area increase for i2c is the highest among all benchmarks. i2c is much smaller than the other benchmarks (only 29 ALMs and registers), making the area overhead relatively larger. Overall, Fmax is not affected greatly changes are mainly due to algorithmic noise. The only exception is with rsdecoder, with reason being that the critical path for this benchmark is altered to pass through the multiplexer. The second structure utilizes a shift-register to store data intermediately to reduce the number of output pins for data acquisition. In this experiment, only one output pin is used. A faster debug clock is generated from the system clock using the Stratix III PLL. The faster clock allows us to shift out the content of the shift-register within one system clock cycle. The area and Fmax values are shown in Fig. 1. Due to the shift register, the area cost is slightly greater than with the simple multiplexer implementation shown in Fig. 9(a). Furthermore, Fmax drops significantly in all cases the system clock speed is limited by the debug clock speed. Similar to Table II, the effect of the the shift register-based structure on the performance of benchmarks is summarized in Table III. As expected, because of the clock multiplier and additional shift registers, the overall area overhead is a bit higher than the area

TABLE II EFFECTS OF AREA-OPTIMIZED MULTIPLEXER IN VARIOUS TRACING SETTINGS. 128-1 Area Increase Percentage (ALM s + registers) (%) 128-2 128-128-8 256-1 256-2 256-256-8 128-1 128-2 Fmax Change Percentage (%) 128-128-8 256-1 256-2 256-256-8 ethernet 6.6 6.91 7.1 7.3 1.95 1.87 11.26 11.53 -.2 -.28 -.2 -.7 -.19 -.31 -.6 -.11 mem ctrl 8.3 6.12 8.87 7.18 1.79 11.57 11.66 13.69-7.51-3.2-1.1-5.2 -.97-8.19-12.15-7.23 tmu 5.92 5.95 6.2 6.11 1.9 1.95 1.99 11.12 2.19 1.99 2.12 2.2 1.6 1.6.98.92 i2c 32.65 35.71 37.1 33.67 57.82 66.33 7.15 72.79-2.13-1.73-2.3-1.55 -.78-3.18-5.83-2.31 rsdecoder 12.3 11.36 1.52 9.86 17. 13.3 19.5 17.79-3.37-35.6-32 -17.51-1.16-33.99-29.87-28.51 main.61.67.75.69 1.15.96 1.11 1.15-2.8-1.81-1.36 -.6-3.7 -.3 2.86 2.16 dfsin.5.6.52.39 1.1 1.8 1.6.96 2.33 3.53-1.5 3.29. 1.57 -.6-3.51 aes 1.35 1.1 1.31 1.67 2.13 2.5 2.8 2.9 -.7 -.7 -.33.77 -.98.17 -.6-1.1 adpcm 1.1 1.61 1.52 1.66 1.8 1.76 1.59 1.6 3.31 3.62 2.7 -.27 3.78 1.79 -.5 -.36 gsm 2.29 2.18 1.93 1.63 3.37 3.82 3.56 3.7-2.1 -.2-1.73-1.77 -.33-1.88-5.39 -.58 ALMs 8 6 2 128 2 128 128 8 256 2 256 25 8 Shift Register Configurations (a) Area ALMs 3 25 2 15 1 5 5 6 2 6 6 8 128 2 128 128 8 256 2 256 256 8 Trace Buffer Configurations (a) Area 3 25 2 15 1 Memory Utilization (Kbits) Fmax (MHz) 3 25 2 15 1 5 128 2 128 128 8 256 2 256 256 8 Shift Register Configurations (b) Fmax Fmax (MHz) 8 6 Fig. 11. 6 2 6 6 8 128 2 128 128 8 256 2 256 256 8 Trace Buffer Configurations (b) Fmax Write Clock Read Clock Area and timing analysis of trace buffer. Fig. 1. Area and timing analysis of area-optimized multiplexers with shift registers. overhead of the simple multiplexer discussed previously. For three of the ten benchmarks, Fmax drops more than 5%. The last proposed structure uses embedded memories to replace the large multiplexer. The area overhead and Fmax are shown in Figs. 11(a) and 11(b), respectively. The area overhead graph shows both the number ALMs (the bar) and the memory utilization (the line). Trace buffers are designed to store 1 samples. As mentioned earlier, the aspect ratio of memory write/read data bus widths of embedded memories in Stratix III is limited to a maximum of 32. Hence, for the last six trace settings in Fig. 11(a), two memory blocks are required and multiplexers are used to select the final data. Taking 128-2 as an example, the aspect ratio of write/read data width is 6 if we want to use one memory only. Due to this limitation, two memories that write 6 bits and read 2 bits are instantiated instead. In addition, four 2-to-1 multiplexers are used to select from the outputs of the memories. Consequently, a small number of ALMs are still required. Fig. 11(b) depicts the Fmax of the write clock and the read clock. Finally, Table IV summarizes the area usage and Fmax of the benchmarks with various sizes of trace buffers. Comparing to the data in the other two tables, one can see that this structure introduces the least area overhead. B. Productivity and Stability In the last set of the experiments, we evaluate the productivity and stability of the conventional design process. Altera s SignalTap is used as the embedded logic analyzer. As mentioned in Section III, due to size of SignalTap, acquiring trace data for a large number of signals is often achieved by successively tracing multiple smaller groups. Recompilation is required when a different group of signals is selected. The experiment is carried out as follows. Two tracing settings are studied: 128-8 and 256-8. In order to use the incremental compilation feature in Quartus II, only post-fitting signals are considered. First, the design is compiled without the SignalTap module. 128(256) post-fitting nodes are randomly selected after the first compilation. Next, eight signals from the set are monitored. The procedure is repeated until all 128(256) signals are traced. The compilation time results are summarized in Table V. The first column lists the benchmarks. The next four columns report the results for the first tracing setting: the compilation time of the proposed process, the first compilation of the SignalTap process, the average compilation time of each debug session and the total cumulative compilation time of the SignalTap-based debugging process. The result of the second tracing setting is reported in the final four columns. As shown in the table, since the proposed bitstream-modifications-only process only requires one compilation, the compilation time roughly equals to the first compilation of the SignalTap process. Although incremental compilation reduces the compilation time by %-8%, each additional compilation adds time overhead. Overall, the proposed process can save up to 93% (i.e., 139/26 for ethernet) in the case of the 128-8 scenario, and 97% (i.e., 13/3233 for rsdecoder) in the case of 256-8. Incremental compilation tries to preserve the engineering effort from a previous compilation to minimize the impact to design performance. While it does well in many cases, experiments show that Fmax can still vary when the monitored signals are on the critical path. The result is plotted in Fig. 12(a). In each case, a total of 32 signals are traced. The x-axis of the plot is the number of traced signals that are on the critical path. The y-axis is the normalized Fmax, where the base is the Fmax of the original benchmark. One can see that Fmax drops in various degrees, as much as 1%. It all depends on what signals are monitored. For designs that can be operated at a very high frequency, the SignalTap module can in fact be where the critical path resides. In this case, monitoring any set of signals can change Fmax, as shown

TABLE III EFFECTS OF AREA-OPTIMIZED MULTIPLEXERS WITH SHIFT REGISTERS IN VARIOUS TRACING SETTINGS. Area Increase Percentage(ALM s + registers) (%) 128-2 128-128-8 256-2 256-256-8 128-2 Fmax Change Percentage (%) 128-128-8 256-2 256-256-8 ethernet 6.81 7.3 7.16 9.71 1.53 11.2-2.76 -.2 -.89-51.1-5.26-53.32 mem ctrl 12.9 12.8 12.22 1.12 15.32 16.1-29.25-28.5-28.67-39.87-35.37-31.33 tmu 6.56 6.2 6.7 9.81 1.29 1.76-5.2-6.8-6.3 -.3-9.68-7.1 i2c 32.31 39.12 37.1 73.7 6.97 72.79-6.91-62.31-58.22-63.78-66.1-61.86 rsdecoder 12.36 13.37 12.53 19.21 2.21 19.8-75.55-71.96-69.2-75 -76.25-76.1 main.21.25.25.6.6.63.77.61 -.72-2.86-1.9.3 dfsin.29.27.39.73.76.85-1.7-9.38-8.57-1.2-7.75-6.2 aes 1.9 1.2 1.13 2.1 2.2 2.17 -.93-2.98-1.9-11.11-1.22-11.96 adpcm 1.22 1.2 1.9 1.7 1.77 1.88-3.63 1.1 1.55.1 3.36 2.63 gsm 2.26 2.9 2.28 3.7 3.79 3.82-3.7-3.11-2.17-7.6 -.63 -.58 TABLE IV EFFECTS OF TRACE BUFFERS IN VARIOUS TRACING SETTINGS. Area Increase Percentage (ALM s + registers) (%) 128-2 128-128-8 256-2 256-256-8 128-2 Fmax Change Percentage (MHz) (%) 128-128-8 256-2 256-256-8 ethernet 6.35 5.85 7.9 9.61 8.8 8.72 -.32 -.52 -.79-2.63 -.16-7.2 mem ctrl 9.78 9.63 11.22 12. 1.8 1.9-9.71-9.1-9.39-5.1-6.51-5.55 tmu.23.21.86 9.78 1.93 1.78-1.3 -.9 -.85-1.9-2.69 -.37 i2c 23.6 22.2 2.62 55.1 7.6 61.22-5.72 -.1-1.38-17.36-9.68 -.9 rsdecoder 12.37 11.5 11.61 18.11 2. 19.1-36.3-35.71-26.89-39.9-39.53-33.38 main.18.2.19.55.58.58 -.59-1.28-1.23 -.93-1.28 -.16 dfsin.28.28.37.69.71.73.87 -.71 -.3 -.2 -.27 -.21 aes 1.9.99 1. 2.1 2.12 2.15.1 -.6 -.9 -.2.5 -.1 adpcm.81.9.93 1.23 1.32 1.39-1.55-1. -9.18-1.69-1.2 -.92 gsm 2.7 2.19 2.2 3.12 3. 3.31-1.95-1.72-2.62-1.9-2.21-2.8 TABLE V COMPILATION TIME OF SIGNALTAP. 128-8 256-8 Prop. SignalTap (sec) Prop. SignalTap (sec) (sec) First Incr. Total (sec) First Incr. Total ethernet 139 13 117 26 11 13 119 396 mem ctrl 15 13 12 2129 156 13 123 73 tmu 169 161 137 235 179 161 1 639 i2c 121 12 18 189 123 12 19 396 rsdecoder 16 13 99 1685 19 13 98 3233 main 19 18 293 611 153 18 29 1737 dfsin 76 696 216 15 711 696 217 768 aes 65 53 186 328 66 53 18 591 adpcm 63 615 226 23 639 615 225 7815 gsm 299 289 153 2738 31 289 153 517 Normalized Fmax 1.98.96.9.92.9.88 Normalized Fmax 1.8.6..2 5 1 15 2 25 3 Critical Path Nodes (a) Tracing nodes on the critical path 5 1 15 2 25 3 (b) Tracing random nodes Fig. 12. rsdecoder i2c Stability of SignalTap. ethernet mem ctrl tmu main dfsin aes adpcm gsm in Fig. 12(b). The x-axis of the plot is the execution session, where 8 signals are traced in each session with 32 sessions in total. The plot shows that Fmax is unstable from one session to another. VI. CONCLUSIONS AND FUTURE WORK Functional debugging using FPGA devices provides several advantages over the traditional software simulation approach. This work presents a set of hardware structures to take the advantage of the FPGA reconfigurability feature to enhance the observability for debugging. Furthermore, experimental results demonstrate that the new techniques can improve the productivity of the debugging process up to 25. One of the extensions to this work can be the integration of debug features, such as trigger events, to the proposed structures to enhance the debugging ability. Another interesting extension is developing a debugging algorithm that utilizes the proposed structures and provides an efficient and effective FPGA debugging environment. REFERENCES [1] I. Kuon and J. Rose, Measuring the gap between FPGAs and ASICs, IEEE Trans. on CAD, vol. 26, no. 2, pp. 23 215, 27. [2] M. Gort and J. Anderson, Deterministic multi-core parallel routing for FPGAs, in IEEE Int l Conf. on FPL, 21, pp. 78 86. [3] L. Lagadec and D. Picard, Software-like debugging methodology for reconfigurable platforms, in Proceedings of the 29 IEEE International Symposium on Parallel&Distributed Processing, 29, pp. 1. [] P. Graham, B. Nelson, and B. Hutchings, Instrumenting bitstreams for debugging FPGA circuits, in IEEE FCCM, 21, pp. 1 5. [5] M. Abramovici, P. Bradley, K. Dwarakanath, P. Levin, G. Memmi, and D. Miller, A reconfigurable design-for-debug infrastructure for SoCs, in ACM/IEEE DAC, 26, pp. 7 12. [6] B. R. Quinton and S. J. Wilton, Programmable logic core based postsilicon debug for SoCs, in IEEE International Silicon Debug and Diagnosis Workshop, May 27. [7] H. F. Ko and N. Nicolici, Algorithms for state restoration and tracesignal selection for data acquisition in silicon debug, IEEE Transactions on CAD, vol. 28, no. 2, pp. 285 297, Feb. 29. [8] X. Liu and Q. Xu, Trace signal selection for visibility enhancement in post-silicon validation, in IEEE/ACM DATE, 29, pp. 1338 133. [9] Y.-S. Yang, N. Nicolici, and A. Veneris, Automating data analysis and acquisition setup in a silicon debug environment, IEEE Trans. on VLSI Systems, 211. [1] E. Hung and S. Wilton, Speculative debug insertion for FPGAs, in To appear in IEEE Int l Conf. on FPL, 211. [11] Design Debugging Using the SignapTap II Logic Analyzer, Altera, Corp., San Jose, CA, 211. [12] Increasing Productivity With Quartus II Incremental Compilation, Altera Corp., San Jose, CA, 28. [13] Stratix III Device Handbook, Altera Corp., San Jose, CA, 211. [1] Y. Hara, H. Tomiyama, S. Honda, and H. Takada, Proposal and quantitative analysis of the CHStone benchmark program suite for practical C- based high-level synthesis, Journal of Information Processing, vol. 17, pp. 22 25, 29. [15] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson, S. Brown, and T. Czajkowski, LegUp: high-level synthesis for FPGAbased processor/accelerator systems, in ACM/SIGDA Int l Symp. on FPGAs, 211, pp. 33 36.