FPGA Floor-Planning Impact on Implementation Results

Size: px

Start display at page:

Download "FPGA Floor-Planning Impact on Implementation Results"

Dominic Lester
6 years ago
Views:

Brigham Young University BYU ScholarsArchive All Theses and Dissertations 2012-11-14 FPGA Floor-Planning Impact on Implementation Results Jaren Tyler Lamprecht Brigham Young University - Provo

edu/etd Part of the Electrical and Computer Engineering Commons BYU ScholarsArchive Citation Lamprecht, Jaren Tyler, "FPGA Floor-Planning Impact on Implementation Results" (2012).

1 Brigham Young University BYU ScholarsArchive All Theses and Dissertations FPGA Floor-Planning Impact on Implementation Results Jaren Tyler Lamprecht Brigham Young University - Provo Follow this and additional works at: Part of the Electrical and Computer Engineering Commons BYU ScholarsArchive Citation Lamprecht, Jaren Tyler, "FPGA Floor-Planning Impact on Implementation Results" (2012). All Theses and Dissertations This Thesis is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in All Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more information, please contact scholarsarchive@byu.edu.

2 FPGA Floor-Planning Impact on Implementation Results Jaren Lamprecht A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science Brad L. Hutchings, Chair Brent E. Nelson Michael J. Wirthlin Department of Electrical and Computer Engineering Brigham Young University November 2012 Copyright 2012 Jaren Lamprecht All Rights Reserved

3 ABSTRACT FPGA Floor-Planning Impact on Implementation Results Jaren Lamprecht Department of Electrical and Computer Engineering, BYU Master of Science The field programmable gate array (FPGA) is an attractive computational platform for many applications because of its customizable nature and modest development cost, in terms of both time and money. As FPGAs scale to increased logical capacities, designers have increased flexibility. However, the FPGA placement problem becomes more difficult at increased sizes. Increasingly, designers are encouraged to structure designs hierarchically and floor-plan. Floorplanning is a manual process which maps specified design submodules to selected physical regions of the FPGA device fabric. This thesis explores several of the effects that floor-planning has on submodules and the designs they comprise. A method is developed to explore the floor-planning impact on submodules independent of a full design. Six different submodules are independently subjected to varying timing constraints and to area constraints of varying aspect ratios and area allocations. The resulting submodule minimum clock periods, routing overflows, and relocatabilities are assembled from millions of submodule implementations. The aggregate results suggest that EDA placement and routing tools can meet design constraints even with extreme combinations of submodule aspect ratio and area allocations; however, the probability of implementations meeting constraints may be low at those extremes. Separate sets of submodule floor-planning guidelines are developed to optimize for meeting minimum clock period constraints, minimizing routing overflow, and maximize relocatability. The submodule floorplanning guidelines for meeting minimum clock period are verified in full design implementations. Keywords: FPGA, floor-plan, area constraint, clock constraint, routing spillover, partial reconfiguration, submodule relocation, Xilinx

4 ACKNOWLEDGMENTS First and foremost, I would like to thank my wife, Elisabeth, for her patience and support while I completed this work. She spent many nights and Saturdays with newborn Bryton until this thesis reached completion. Her love and dedication provided motivation to finish. I am also grateful for my parents and their many years of love, support, and faith that gave me such a wonderful start in life. I thank my father for showing such a tremendous example of work ethic to provide for his family. I would like to thank Dr. Brad Hutchings for introducing me into the BYU Configurable Computing Lab which led to me into the graduate program. I am grateful for his guidance and patience as I sought to carve out a topic for this thesis. Thanks to Dr. Brent Nelson for heading BYU s involvement in CHREC. CHREC has been a great opportunity for research and interaction with industry. I am grateful for the countless hours spent by each of the professors involved in extending our collaborative efforts. I would also like to thank Dr. Michael Wirthlin for his time and support. His dedication and preparation for providing challenging, yet rewarding courses helped me to grow as a student. His digital system design course first introduced me to FPGAs. Thanks also to the students of the CCL lab whose examples and work helped me significantly as a graduate student. A special thanks to Chris Lavin, whose ideas drove much of the work I was a part of in the CCL, and who was a constant, patient source of information. Special thanks also to the BYU Fulton Supercomputing Lab. Without their facilities, this work would not have been possible. Their machines provided nearly 1,000,000 processor hours of data for this work. This research was supported by the I/UCRC Program of the National Science Foundation under Grant No through the NSF Center for High-Performance Reconfigurable Computing (CHREC). ii

5 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES v vi Chapter 1 Introduction Motivation Preview of Approach Primary Contributions Chapter 2 Background and Related Work Xilinx Virtex FPGA Architecture Xilinx Description Language (XDL) and RapidSmith CLB Tiles DSP Tiles BRAM Tiles Interconnect Tiles Other Tiles Xilinx EDA Tool Flow XST NGDBuild MAP PAR BITGEN Randomization in EDA Tools Incremental FPGA Design and Floor-planning Xilinx Partitions Floor-planning Overview of this Work: Floor-planning Impact on Implementation Area Constraint Impact on Timing Area Constraint Impact on Routing Overflow Area Constraint Impact on Submodule Relocation Chapter 3 Independent Submodule Implementation Motivation for Independent Submodule Implementation Independent Submodule Implementation Submodule Wrapping Hard Macro (HM) Barriers Wrapped Submodules in the Xilinx EDA Flow Chapter 4 Area Constraint Impact on Submodule Timing Submodule Experiments Setup Representative Submodules iii

6 4.1.2 FPGA Device Selection Establishing Baseline Submodule Clock Constraints Variation of Area Constraints Automatic Generation of Area Constraints Submodule Implementation with Various Area Constraints Area Constraint Impact on Submodule Minimum Clock Period SX Series Results LX Series Results Conclusions Chapter Floor-planning Impact on Design Timing Test Designs FP-Pipe Design Independent FP Submodule Implementations with.2 ns Constraint FP-Pipe Design Floor-plans FP-Pipe Floor-plan Implementation Results MIMO Design MIMO Design Floor-plans MIMO Floor-plan Implementation Results Conclusions Chapter 6 Other Submodule Area Constraint Effects Submodule Routing Overflow Submodule Routing Overflow Results Handling Submodule Routing Overflow PR and Submodule Routing Overflow PR Submodule Routing Overflow Results PR Submodule Implementations without Routing Overflow Minimizing Submodule Routing Overflow Submodule Relocation Submodule Relocation Experiment Setup Submodule Relocation Results Maximizing Submodule Relocatability Submodule Implementation Runtimes Chapter 7 The Future of FPGA Floor-planning Motivation Contributions Community Impact Future Work REFERENCES iv

7 LIST OF TABLES 4.1 Representative Submodules Submodule Baseline Clock Constraints Test Designs Estimated Device Utilization Test Designs Actual Device Utilization FP-Pipe Implementation Results MIMO Submodule Utilization Estimates MIMO Implementation Results v

8 LIST OF FIGURES 2.1 Virtex Device Architecture Virtex Tiles and Resources Virtex Route Example Conventional Xilinx EDA Flow UCF Constraints Wrapped Submodule Hard Macro Barrier Wrapped Submodule with Output-collector FIR Submodule Clock Period Sweep Area Constraint Generator Composite Area Constraint SX Series Submodule Clock Period Results LX Series Submodule Clock Period Results Independent FP Submodule Implementations with.2 ns Constraint Full FP-Pipe Design with Various Floor-plans Full MIMO Design with Constrained FFT Submodule MIMO Floor-plans with 10% Area Overhead MIMO Floor-plans with 2% Area Overhead MIMO Floor-plans with 0% Area Overhead Routing Overflow SX Series Routing Overflow Results LX Series Routing Overflow Results SX Series PR Routing Overflow Results LX Series PR Routing Overflow Results SX Series PR Implementations without Routing Overflow LX Series PR Implementations without Routing Overflow Submodule Valid Placements vi

9 CHAPTER 1. INTRODUCTION 1.1 Motivation Over the past two decades, the field programmable gate array (FPGA) has matured into a powerful computational platform [1]. Throughout the years, each advancement in silicon fabrication has allowed FPGA capacities to increase. As of 2012, FPGA capacities measure in the millions of logic gates and include a variety of dedicated cores such as memories, processors, and standard I/O cores. These computational resources allow FPGAs to replace application specific integrated circuits (ASICs) in many applications. While ASICs can achieve higher clock rates, lower area, and lower power than FPGAs [2], FPGAs have carved out some of the ASIC market share primarily because of three reasons. First, FPGAs, as the name implies, are field programmable, which enable them to be reprogrammed with a new design just as easily as the first design is programmed on the device. This allows for introducing new features or fixing old ones by updating the device in the field such as through a firmware update. This is a powerful feature which is not available with ASICs. Secondly, FPGA designs have a faster time-to-market than ASICs and can be used for rapid prototyping. The FPGAs are already produced by the manufacturer and are waiting to be programmed with a design. The product can be ready as soon as the design is completed. In contrast, ASIC design generally requires at least 1-2 years lead time to complete the design and then ramp up production [3,4]. Trouble with leading-edge processes can further delay ASIC production and yields. Lastly, and most importantly, FPGAs have significantly lower non-recurring engineering (NRE) costs compared to leading-edge ASICs. The ASIC development, verification, and fabrication costs will total millions of dollars before the first chip is ever produced. In contrast, a FPGA design can be produced with a NRE cost of a few engineer salaries and each FPGA device can be purchased from the manufacturer. 1

10 Traditionally, FPGA designers have been able to rely heavily upon electronic design automation (EDA) tools provided by FPGA vendors to automatically place logic elements on the FPGA. ASIC design placement has also been largely automated, it is not uncommon for at least some portion of the ASIC to be manually placed. Placement is a known computationally hard problem. The number of different placements available on a given FPGA is approximately the factorial of the number of logical elements available on the device [1]. EDA tools therefore use some form of heuristic search to find suitable design placements. Up until recently, EDA tools consistently provided quality FPGA design implementation results in time on the order of hours. However, as FPGA devices have grown to millions of gates, EDA tools have struggled to keep up on the most demanding designs. It is not unheard of for FPGA designs to now take days to implement after a design change []. The resulting implementation may not even meet the minimum clock period and will need to be re-implemented. Many long FPGA design implementations add up to increased NRE costs and a longer time-to-market. In an effort to solve this problem, FPGA vendors have introduced floor-planning and incremental approaches to help EDA tools provide quality FPGA implementations in a timely manner. Floor-planning is a well-studied area of ASIC design, but there is little data available to FPGA designers on how to floor-plan a design with the goal of meeting a minimum clock period. The goal of this work is to gather a substantial amount of FPGA floor-planning data and use that data to provide guidance in floor-planning to meet timing. With that guidance, designers can increase the probability that implementations meet timing and cut down on the ever-increasing amount of time spent re-implementing demanding FPGA designs. 1.2 Preview of Approach The ultimate goal of this work is to provide FPGA floor-planning guidance to designers. Most often, a designer s primary concern when floor-planning a FPGA design is meeting minimum clock period timing constraints, therefore the minimum clock period achieved by an implemented design becomes the primary metric for comparing different floor-plans. The randomized nature of EDA tools, however, makes minimum clock period comparisons difficult. Typical EDA tools rely heavily upon random numbers to find suitable design implementations; however, the random number generation is seeded with an integer so that implementations are deterministic. To obtain 2

11 a different implementation of the same design, the EDA tools can be supplied with a different seed. Therefore, in this work, to compare floor-plans, each floor-plan is implemented many times with different seeds to determine what percentage of the floor-plan s implementations meet the minimum clock period constraints. In this work, the identification of floor-planning guidelines is approached hierarchically. Because the floor-plan of a particular design consists of one or more design submodules assigned to specific areas, submodules can be examined independent of a full design to better understand how to assign the submodules to specific areas. This work develops a method for implementing design-independent submodules, examines a representative set of design-independent submodules assigned to varied areas and determines guidelines for assigning areas to submodules. The guidelines determined for design-independent submodules are then verified by applying them to submodules in a full design floor-plan. 1.3 Primary Contributions FPGA vendors provide some guidelines for developing floor-plans, but design data supporting these guidelines is scarce. A number of case studies exist where floor-planning has helped designers meet timing, but case study success cannot be extended to a generality. This work is unique in that it examines floor-planning at an unprecedented depth and generates enough design data that generalities emerge and suggest floor-planning guidelines. In addition to the floor-planning impact on minimum clock period, this work also covers several other floor-planning effects unique to incremental partition design flows, partial reconfiguration design flows, and hard macro design flows. The main contributions of this thesis are listed below: Develops a method for implementing design submodules independent of the full design. (Chapter 3) Determines how floor-plan constraints impact design-independent submodule minimum clock period. (Chapter 4) Develops guidelines for selecting submodule floor-plan constraints. (Chapter 4) 3

12 Verifies submodule floor-planning guidelines in a full design context. (Chapter ) Determines how floor-plan constraints impact routing overflow in design-independent submodules. (Chapter 6) Determines how floor-plan constraints impact submodule relocation. (Chapter 6) 4

13 CHAPTER 2. BACKGROUND AND RELATED WORK This chapter describes basic FPGA architecture, FPGA EDA tool flow, incremental FPGA design and floor-planning, and related work in FPGA floor-plan analysis and guidance. This background should help the reader better understand the contributions found in later chapters. 2.1 Xilinx Virtex FPGA Architecture A FPGA is an integrated circuit fabricated to be configured by a designer after manufacturing. FPGAs are composed of reconfigurable logic and routing interconnect, providing a framework on which a desired circuit can be realized. The reconfigurable logic is capable of performing logical computations, memory storage, I/O, and other functions. Modern FPGAs also have many fixed functional units such as block memories, fast multiply-accumulators, multi-gigabit tranceivers, PCI Express cores, or even CPU cores. The routing interconnect is a dense network of wires with programmable interconnections which allow connections to be made between logic elements [6]. The specifics of a FPGA s architecture vary among vendors and product families. This work targets the Xilinx Virtex family for its studies on floor-planning [7] Xilinx Description Language (XDL) and RapidSmith Detailed Xilinx architecture descriptions for each of their FPGA devices are available through the Xilinx description language (XDL) executable included with the Xilinx EDA tools [8]. XDL is an executable included with Xilinx releases and runs in three modes: -ncd2xdl, -xdl2ncd, and -report. The first mode, -ncd2xdl, converts a Xilinx native circuit description (NCD) file to a XDL file. XDL files are self-documenting textual netlists equivalent to the NCD and describe the circuit in the mapped, placed, or routed states. The second mode, -xdl2ncd, is the inverse of the first, converting XDL to NCD. The final mode, -report, can be used to generate a XDLRC report of the physical resources available on a Xilinx device.

14 While XDL is useful for providing readable text representations of designs and devices, its utility is greatly extended by the open source RapidSmith XDL project [9]. RapidSmith provides a framework for easily extracting detailed information from the XDLRC device descriptions and manipulating Xilinx devices and designs. The XDLRC report abstracts the FPGA into a two dimensional array of tiles which are laid out edge to edge to form a rectangular grid. All of the configurable FPGA resources exist on this tile grid. Figure 2.1 shows a tile level view of the Virtex xcvsx240tff178 device. This device is typical of Xilinx devices and has a column-style architecture. At a high level, there are regions of columns dedicated to I/O. The central column also includes resources specific to clocking. The remaining portions of the device are mainly dedicated to user logic and interconnect. The column architecture extends down to the tile level. Figure 2.2 shows a zoomed in portion of the logic region of tile grid from the xcvsx240t device. The labels in Figure 2.2 indicate some of the main Virtex resources: CLBs, DSPs, BRAMs, and Interconnect. Each column of tiles only has one of the four resource types previously mentioned. So a column may be referred to as a DSP column, a BRAM column, a CLB column, etc CLB Tiles Each configurable logic block (CLB) occupies a single tile on the tile grid. A CLB in the Virtex family contains two entities called slices. Each slice contains 4 look up tables (LUTs), 4 flip-flops (FFs), and a few other gates useful for specific logic functions. The LUT is the basic unit of programmable logic. Virtex LUTs have 6 available inputs and can implement any 6-input logic function using 64 bits of configuration memory. Some LUTs have additional functionality that allow the LUT to act as a memory of 64 bits. These configurable-as-memory LUTs are found in slices known as SLICEMs, while the basic LUTs are found in slices known as SLICELs. In the Virtex, a CLB may contain 1 SLICEL with 1 SLICEM or it may contain only 2 SLICELs DSP Tiles In a DSP column, each DSP functional block effectively occupies 2. tiles (2 DSPs occupy a total of tiles). The DSP is a versatile block capable of performing high-speed arithmetic- 6

15 I/O Tiles Logic(CLB, DSP, BRAM) and Interconnect Tiles I/O Tiles Central Clocking and I/O Tiles I/O Tiles Logic (CLB, DSP, BRAM) and Interconnect Tiles Figure 2.1: A view from the RapidSmith device browser [9] depicting the tiles on the xcvsx240t device. High level tile regions are indicated. 7

DSP BRAM 10 Tiles Interconnect CLB 13 Tiles Figure 2.2: A view from the RapidSmith device browser [9] depicting a section of tiles on the xcvsx240t device. Several device resource types are indicated.

16 DSP BRAM 10 Tiles Interconnect CLB 13 Tiles Figure 2.2: A view from the RapidSmith device browser [9] depicting a section of tiles on the xcvsx240t device. Several device resource types are indicated. intensive operations such as those often required for many DSP algorithms. It is most often used for multiply or multiply-accumulate functions. A single block can perform an 18x18 bit multiply, or multiple DSP blocks can be used to perform operations on larger operands. The number of DSPs available on a device varies with the size of the device as well as the series. The Virtex SX series trades CLB columns for additional DSP columns and is targeted at designs that could benefit from a large number of DSPs BRAM Tiles The block random access memory (BRAM) is available in BRAM columns, with each BRAM occupying tiles. Each BRAM is a 36 Kbit dual-port RAM capable of configurations 8

17 ranging from 36K entry 1-bit RAM to 1K entry 36-bit RAM. BRAMs may also be cascaded to produce larger memories. Like DSPs, the number of BRAMs available on a device also varies with both the size of the device and the series Interconnect Tiles The wiring segments that are available on the FPGA are arranged in north-south and eastwest tracks that conceptually overlay the tile grid. The wire segments can be connected together by programmable interconnect points (PIPs). Each interconnect tile contains hundreds of PIPs for connecting different wires. Some wires span Manhattan distances as short as 2 interconnect tiles away while some wires can span Manhattan distances up to 18 interconnect tiles away. Intermediate Manhattan distances are also available. Each interconnect tile also has connections to the single nearest resource tile to its east on the tile grid. Therefore, a signal leaving on a wire from a particular resource tile makes its first wire connection in the interconnect tile to the west of that resource tile, proceeds to make further wire connections in interconnect tiles toward its destination resource, and makes its final wire connection in the interconnect tile that is immediately west of the destination resource. For an example of a route composed of interconnected wires, see Figure 2.3. Here, the source of the signal is the SLICE located in the CLB tile located near the bottom center of Figure 2.3. The signal exits the CLB tile on a wire which goes west into the interconnect tile associated with that CLB. There, the wire connects to another wire which runs north 6 interconnect tiles. The northbound wire then connects to another wire which runs west 2 interconnect tiles. The westbound wire then connects to another wire which travels north 2 interconnect tiles, where it has then arrived at the appropriate interconnect tile associated with the destination DSP Other Tiles There are many other tile types in the Virtex architecture, but the aforementioned ones are of primary concern in this work. Other tile types include those associated with I/O, clock circuitry, FPGA configuration, and null space reserved for hard PowerPC or PCI Express cores. 9

2 Xilinx EDA Tool Flow The major FPGA vendors provide proprietary EDA tools for compiling hardware description language (HDL) designs to target their FPGAs. Figure 2.

18 10 Tiles 13 Tiles Figure 2.3: A view from the RapidSmith device browser [9] depicting a section of tiles on the xcvsx240t device. Shows a route composed of a set of wire interconnections from a SLICE to a DSP. 2.2 Xilinx EDA Tool Flow The major FPGA vendors provide proprietary EDA tools for compiling hardware description language (HDL) designs to target their FPGAs. Figure 2.4 shows the conventional Xilinx FPGA EDA flow. User inputs are shown as white files. Xilinx executables are shown as black boxes. Intermediate Xilinx design files are shown in gray. Xilinx uses the common term synthesis to describe the first step in the flow. The steps that convert logical synthesis results to a physical circuit description Xilinx collectively terms implementation. The implementation results can be converted to a bitstream which configures an FPGA device. 10

19 .UCF.VHD.VHD.V XST.V.NGC NGDBuild.NGD MAP.NCD PAR.NCD BITGEN.BIT FPGA Synthesis Implementation Figure 2.4: The conventional Xilinx EDA flow converts user-input HDL and design constraints to a bitstream to program a FPGA. User input is indicated in white. Xilinx executables are indicated in black. Intermediate proprietary Xilinx netlist formats are indicated in gray XST Typically, a circuit which will be implemented on an FPGA is initially described by the designer in a HDL such as VHDL or Verilog. When a design has been sufficiently described in the HDL, it can then be synthesized into a netlist form. A netlist is a collection of basic elements of logic (BELs) along with a list of nets, or connections, between them. The Xilinx synthesis tool is XST, and its proprietary netlist format carries the extension NGC NGDBuild After synthesis, the netlist of BELs is technology-mapped to a specific FPGA architecture. Xilinx technology-mapping begins in NGDBuild. NGDBuild converts the netlist of BELs to a netlist of lower-level Xilinx primitives such as AND gates, OR gates, LUTs, FFs, RAMs, etc. This conversion involves some packing of the BELs and reduction in the number of nets. NGDBuild also adds user constraints to the design. User constraints are input to NGDBuild in a user constraint file (UCF). Examples of constraints are the locations of I/O pads, minimum clock periods, and area constraints for portions of the design. NGDBuild outputs a native generic database (NGD) file which describes the design as a netlist of lower-level Xilinx primitives MAP Xilinx technology-mapping is completed by MAP. The lower-level Xilinx primitives are mapped and packed into LUTs, FFs, RAMs, and other resources available on the targeted FPGA. 11

20 When the technology-mapping and packing portion of MAP is completed, the design exists as a netlist of higher-level Xilinx primitives such as SLICEs, DSPs, BRAMs, and IOBs. In Virtex and later families, MAP then performs a timing-driven placement of the design. The placer assigns each logical SLICE, DSP, BRAM, or other high-level primitive to a physical primitive site on the device while respecting any design location or area constraints and optimizing the placement to meet any timing constraints. The result of MAP is a native circuit description (NCD) file, a fully technology-mapped and placed netlist describing the design PAR After placement, the nets of the design are routed. Routing assigns physical wires to all of the nets of the design while optimizing the wires chosen to meet minimum clock period or other timing constraints. The Xilinx router is PAR, and it effectively annotates the nets of an input NCD file with routing information and outputs a routed NCD file. The completely placed and routed design is the result of the collective implementation process, and it is referred to as a design implementation BITGEN At this point, the design is completely described as a circuit that can exist on the FPGA. If all the user constraints are met by the tools up to this point, the design will function on the FPGA as the user described it. To be programmed on the FPGA device, the NCD description of the circuit must be translated into configuration bits which configure the physical resources of the FPGA. The Xilinx tool that performs this translation is BITGEN. BITGEN takes as input the placed and routed NCD file and outputs a BIT file which can be used to program the FPGA device. When the BIT file is loaded onto the device, the circuit becomes operational on the FPGA Randomization in EDA Tools Certain steps in the FPGA EDA tool flow cannot be realistically implemented with exact algorithms. The placement problem is particularly difficult because the number of valid design placements increases exponentially with the the number of available device resources. The placer 12

21 seeks to find a valid placement which should satisfy all of the user timing constraints after it is routed, however an exhaustive search of all the valid placements is prohibitively large. As the number of resources on new FPGA devices continues to increase, the placement problem becomes increasingly large, and the portion of the placements which can be searched through in a reasonable amount of time decreases. This can lead to many iterations of placement and routing to arrive at a suitable design implementation, resulting in long design cycles. Industry and academia are well aware of this problem and are actively researching and suggesting hierarchical incremental implementation as one possible solution [10 1]. 2.3 Incremental FPGA Design and Floor-planning As FPGA designs grow in size and complexity, a natural way to manage the complexity is to introduce hierarchy and break the design up into smaller submodules. This has always been standard practice at the HDL design level, but the push to implement an FPGA design as separate submodules has only developed within recent years. This type of hierarchical design and implementation can be an incremental process from start to finish [21]. Incremental FPGA implementation can drastically reduce the burden on the placer. When the placer only needs to place a fraction of the total design, it can typically find a suitable placement in a proportional amount of time [12]. With adequate planning and a bit of luck, if each of the design submodules have a placement that meets timing, then the composite design will meet timing. Later, if a change needs to be made to the design, the impact of the implementation change is often limited to a single submodule. The drawback to a hierarchical approach is the loss of global design optimizations across the hierarchical boundaries, which can lead to increased resource utilization. With well-designed boundaries, the increased utilization is manageable, and an incremental approach can reduce the time spent implementing designs Xilinx Partitions FPGA vendors have begun to provide hierarchical incremental implementation options in their proprietary FPGA compilation flows. Xilinx provides partitions as a hierarchical incremental implementation option in their conventional FPGA compilation flow [16]. In this partition flow, 13

22 the hierarchical submodules can be marked as partitions. Previously implemented partitions can be imported on a subsequent implementation run Floor-planning In order to combine implemented submodules of a design into a composite whole design, there cannot be any physical resource conflicts among the submodules. To prevent physical resource conflicts, a design must be floor-planned. A floor-plan is essentially a map created by the designer that assigns submodules to specific physical regions on the FPGA. Each physical region must have an adequate number of resources available for the submodule assigned to it, and the arrangement of physical regions on the chip should reflect submodule interconnectivity and timing constraints. A floor-plan adds additional strain on the placer. The placer has less freedom to place the elements of the design. This can be helpful if the floor-plan provides an effective coarsegrain arrangement of design submodules. The floor-plan provides early guidance to the placer and prevents exploration of many inferior coarse arrangements. A poor floor-plan, however, can effectively eliminate any chance of finding an implementation that meets design constraints because the placer is not be free to expore superior alternatives. Xilinx Area Constraints Floor-planning Xilinx FPGAs is performed by adding physical area constraints to a design, often in the UCF file [17]. Constraints are applied in a UCF as shown in Figure 2.. Hierarchical logical divisions of a design netlist can be constrained to reside within a particular region of the device. This region is defined by an AREA GROUP specifying physical ranges of SLICEs, DSPs, and BRAMs, each of which exist in separate namespace grids. The separate namespace grids have unrelated coordinate systems, so selecting them requires a view that can relate them. Xilinx provides the PlanAhead graphical user interface to facilitate user generation of area constraints [18]. PlanAhead provides a device layout and allows the user to select a region of resources for the area constraint. This selection simply adds the AREA GROUP ranges to the UCF file. 14

23 #Clock Period Constraint NET clk TNM_NET = clk1 ; TIMESPEC TS_clk1 = PERIOD clk1 2.8 ns HIGH 0 %; #Area Constraints INST "fft1024_x0/*" AREA_GROUP = "fft_group"; AREA_GROUP "fft_group" RANGE=SLICE_X24Y81:SLICE_X3Y160; AREA_GROUP "fft_group" RANGE=DSP48_X2Y34:DSP48_X3Y6; AREA_GROUP "fft_group" RANGE=RAMB36_X2Y17:RAMB36_X3Y32; Figure 2.: Sample constraints applied to a design. The clock period is constrained to 2.8 ns. The fft1024 x0 component is constrained to an AREA GROUP comprised of separate ranges for SLICEs, DSPs, and BRAMs. Another method of relating the separate resource namespace grids is to use the XDL tile grid. Each resource exists in a tile on the tile grid, so the tile grid can provide an abstraction for distance between resources from different namespaces. For example, on the xcvsx240t device, SLICE X0Y0 and DSP48 X0Y0 reside at tiles (6,263) and (24,263), respectively. The tile grid also allows for a selected region of tiles to be easily translated into the three namespaces required for AREA GROUP constraints. The tile grid will be used heavily throughout this work for choosing area constraints for two primary reasons. First, the tile grid is a common grid covering the entire device, so manipulations and descriptions are simplified. Secondly, the RapidSmith framework provides an excellent starting point for automating the area constraint selection process. 2.4 Overview of this Work: Floor-planning Impact on Implementation The subsequent chapters of this work investigate in depth some of the varied effects that area constraints can have on a submodule s implementation Area Constraint Impact on Timing The main contributor to overall time spent implementing designs is often the set of timing constraints. If an implementation did not meet the timing constraints, then the design needs to be 1

24 implemented again. So, one of the greatest concerns when choosing a floor-plan should be how the floor-plan will impact the probability that an implementation will meet timing. Xilinx provides some limited guidance to help designers choose submodules and assign them to physical areas on the fabric [16, 18, 19]. Some of the suggestions include registering the logic on the submodule boundaries, using rectangular area constraints that are near square, and providing at least 2% extra resources within the area constraint boundaries. However, Xilinx does not supply any design data to support the suggestions in their standard documentation. A separate Xilinx whitepaper exists which has five points of data suggesting that square area constraints provide the highest realizable clock rates [20], but those five points were collected for the relatively old Virtex 2 architecture and only on a single submodule. Other case studies exist, where designers have demonstrated that floor-planning can aid in achieving timing closure [21], but these success stories mostly suggest that floor-planning is beneficial and provide little direction in general. Rather than examine a single design with a few floor-plan modifications, this work first takes a step back and investigates how area constraints impact individual submodules in Chapter 4. Chapter 3 first discusses how a submodule can be implemented independent of a full design. Chapter then returns to a full design to determine how a floor-plan impacts timing in a full design Area Constraint Impact on Routing Overflow A Xilinx floor-plan only restricts the placement of a design on the FPGA fabric. The router does not respect any area constraints, so routes may overflow outside of the area constraint, using interconnect tiles that were not enclosed within the AREA GROUP boundaries. This routing overflow can cause routing resource conflicts in interconnect tiles even when the area constraints prevented conflicts in other tiles. In timing-driven routing (PAR), these conflicts are uncontrollable without the designer constraining each individual submodule route with directed routing constraints [14, 17]. While routing overflow cannot be controlled in timing-driven routing, it could be better understood so that the probability of conflict may be reduced. The floor-planning impact on routing overflow is discussed in Chapter 6. 16

25 2.4.3 Area Constraint Impact on Submodule Relocation The author helped develop HMFlow [22], an alternative hierarchical incremental design flow which accelerates implementation for Xilinx devices using using cacheable, relocatable submodules known as hard macros. Previously implemented design submodules are stored as hard macros which are imported during subsequent implementation runs. In HMFlow, a floor-plan is not determined before implementation because the relocatable hard macros are placed by a coarse-grain placer. However, resource conflicts must still be avoided in the placement, so each submodule must still have area constraints. The area constraint applied to a submodule during its initial implementation impacts the number of legal relocations available for it on the FPGA fabric. Chapter 6 provides more detail on this topic. 17

26 CHAPTER 3. INDEPENDENT SUBMODULE IMPLEMENTATION This chapter describes a method for implementing a design submodule independent of the full design using the conventional Xilinx design flow. If desired, the implementation results of this method can also be automatically transformed into a Xilinx hard macro. 3.1 Motivation for Independent Submodule Implementation In the conventional Xilinx design flow, at the implementation stage, a design that may have been hierarchical at earlier stages is flattened to a single-level netlist of physical resources which must be placed and routed on the FPGA. The placement and routing of each resource affects the placement and routing of all other resources. Constraints that may have applied to one submodule indirectly affect another submodule s implementation. In order to understand how area constraints impact the submodule, the interdependency with the rest of the design must be removed. The submodule must be implemented independent of the full design context. For individual submodule implementation, Xilinx offers partitions; however, partitions do not completely remove a submodule from a design context [16]. A submodule that is labeled as a partition is still implemented in a full design context. At a minimum, that design would include IOBs that connect to the submodule partition. Since partitions cannot adequately remove a submodule from a design context, a novel way of implementing modules independently is developed in this work. 3.2 Independent Submodule Implementation The method used in this work to independently implement submodules was originally developed to automatically generate hard macros for HMFlow [22]. It is largely the result of trial and error experimentation with the conventional Xilinx tools over hundreds of different submodules. The following subsections elaborate the submodule synthesis and implementation process. 18

27 3.2.1 Submodule Wrapping It is the nature of the Xilinx synthesis and technology-mapping tools to iteratively remove any unused circuit outputs or logic. A design consisting of only a submodule without any connected outputs would therefore be optimized out of existence. A full design would have top level IOBs which would retain the useful logic, but the location of IOBs can affect the placement of the submodule. Also, relying upon IOBs to prevent submodule annihilation limits the number of ports on a submodule to the number of user logic IOBs available on a particular device. Therefore, an alternative method is implemented for independent submodule implementation. To prevent the zealous optimization process, the submodule can be wrapped with black box barriers to the optimization process. These barriers take the form of Xilinx hard macros (HMs), which use the file extension NMC and are essentially equivalent to NCDs. A Xilinx HM is a circuit in a post-technology-mapped state. A HM may also be in a placed or routed state. The HMs used as black box optimization barriers for submodule implementation are not placed. These barriers wrap the submodule as shown in Figure 3.1. [3:0] bits Port A HM Barrier [3:0] bits Port A Port C 1 bit Port C HM Barrier 1 bit SUBMODULE 1 bit Port B HM Barrier 1 bit Port B Port D [2:0] bits Port D [2:0] bits HM Barrier Wrapper (Top Module for Synthesis) Figure 3.1: An example submodule wrapped with hard macro barriers. 19

28 3.2.2 Hard Macro (HM) Barriers Xilinx HMs can be inserted into an HDL design as any other pre-compiled netlist. In VHDL, the HM barriers are declared as a components and instanced to wrap the submodule. The HM NMCs are imported during NGDBuild as any other NGC or EDIF netlists would be. Each HM barrier is a collection of unplaced SLICEs with one SLICE per bit of the signal passing through it as shown in Figure 3.2. Each SLICE passes its single bit of the signal through a single LUT and back out of the SLICE. The remainder of the LUTs are also arbitrarily configured so that they are not an available source for static GND or VCC during routing. While a single SLICE could act as a barrier for up to four bits of submodule I/O, it can be advantageous to only allow one bit per SLICE. Bit 0 Port A SLICE 0 Bit 0 Bit 2 [3:0] bits Bit 1 Bit 2 Port A SLICE 1 Port A SLICE 2 Bit 1 Bit 2 [3:0] bits Bit 2 A LUT B LUT C LUT A FF B FF C FF Bit 3 Port A SLICE 3 Bit 3 D LUT D FF Port A HM Barrier Port A SLICE 2 (Virtex ) Figure 3.2: An example of a 4-bit hard macro barrier. SLICE detail is shown with solid lines indicating the SLICE configuration. 20

29 As noted earlier, the barriers were originally developed for HMFlow, where the name of the barrier is used to tag an individual bit of a signal through the implementation process so that it may be identified post-implementation. This feature is key to automatically generating hard macros of submodules Wrapped Submodules in the Xilinx EDA Flow A wrapped submodule can be synthesized and implemented through the conventional Xilinx EDA flow with a few specific command line options. This work utilizes the Xilinx v13.1 command line tools [23]. XST A submodule wrapped in a HM barrier is synthesized like any normal submodule is synthesized in XST. All XST options are defaulted except the -iobuf NO option which indicates that IOBs should not be inserted into the design. XST identifies the barriers as black boxes which will be resolved later. The submodule is thus synthesized without any context. NGDBuild NGDBuild requires as input the wrapped submodule NGC, any other NGCs required by the submodule, and the HM barrier NMC(s). Optionally, a UCF file may be specified with timing and/or area constraints. If a clock constraint is specified, the HM barrier does not affect the clock net that the constraint applies to because the global clock buffer is always inserted between the barrier and the submodule. Any timing calculations are performed relative to the clock as it leaves the global clock buffer. If an area constraint is included, it should specify that it applies only to the submodule. This specification ensures that the submodule does not compete with the barriers for resources during placement. Without any constraints applying to the barriers, their placement will naturally surround the submodule just as they logically surround the submodule in Figure

30 The NGDBuild process identifies the HMs that match the barrier black boxes, but unlike with NGCs or EDIFs, the HMs do not replace the black boxes at this time. The HM black boxes effectively remain until the submodule has been completely technology-mapped. MAP MAP first completes technology-mapping for the submodule and then fills the HM black boxes with the technology-mapped HM definitions. Before proceeding to placement, MAP optimizes the circuit by removing unused logic. Unused logic is defined as logic that is undriven, does not drive other logic, or logic that acts as a cycle and affects no device output [23]. When MAP is run for a wrapped submodule, all options are left defaulted except for the -u option which is enabled to indicate that MAP should not remove unused logic. Removing unused logic would annihilate the wrapped submodule circuit because none of the logic affects the device output (there are no IOBs). The -u option applies a NOCLIP property to all incomplete nets in the design, preventing trimming from beginning at any of those points and cascading through the design. However, any dangling component that is not referenced by an incomplete net is still subject to trimming. A submodule that is implemented without HM barriers will not have any dangling incomplete output nets to append NOCLIP properties to, so such a submodule is optimized out of existence. The HM barriers include incomplete nets in their definition, so the requisite incomplete nets are added to the design when the HMs are imported. With the HM barriers, before MAP attempts to remove unused logic, the design state is similar to Figure 3.1, where the dashed lines represent the incomplete nets. Therefore, when the -u option is invoked, MAP does not remove any logic from the wrapped submodule design. One fatal error that MAP may encounter in a wrapped submodule design occurs in the case that the submodule has one or more unused input ports. When the submodule has an unused input, then there will be no net connecting the HM barrier that should be wrapping that submodule input port to the rest of the submodule. MAP will not proceed given this condition. Although not encountered for the submodules presented later in this work, automated submodule wrapping for hundreds of different modules has uncovered this error. 22

31 To handle this error in general, it must be ensured that every input port HM barrier is connected to some resource that is guaranteed to be present in addition to connecting to the submodule since the submodule may not use the input port. To guarantee that a particular resource is present, another HM can be used. This HM is referred to as an output-collector HM because it connects to all the outputs of the submodule input HM barriers, as shown in Figure 3.3. For purposes related to generating hard macros in HMFlow, this output-collector may optionally combine all of its inputs down to a single bit output. For Virtex and newer devices, once MAP has completed technology-mapping and logic trimming, MAP performs timing-driven placement. The timing-driven placement attempts to find a placement that will meet all of the timing constraints when routed. While the placer needs to place both the submodule and the barrier components, the impact of the HMs on the submodule s implementation is negligible for two reasons: The barriers are initially completely unplaced. The HMs do not even contain relative placement constraints. The nets to or from the barriers are not affected by any timing constraints. When placement begins, because the barriers have no initial placement, the placer can treat all barrier and submodule components equally. Because there are no timing constraints affecting the nets to or from the barriers, the timing-driven placer has little preference where the barriers are placed. The placer is free to prioritize meeting timing constraints on the submodule and the barriers [3:0] bits 1 bit Port A HM Barrier Port B HM Barrier [3:0] bits 1 bit Port A Port C SUBMODULE Port B Port D 1 bit Port C HM Barrier 1 bit [2:0] bits Port D [2:0] bits HM Barrier HM Output- Collector Wrapper (Top Module for Synthesis) Figure 3.3: Modified submodule wrapper to include a HM output-collector. 23

32 can end up anywhere not occupied by the submodule components. If the submodule happens to be constrained to reside in a particular area, the placer almost invariably places the barriers around the perimeter of the submodule. PAR After placement, PAR routes the wrapped submodule design. Because the barrier HMs have no routing information, all routing resources are initially available as with any other conventional design. Because there are no timing constraints on the nets to or from the barriers, any route is satisfactory for those nets. The router can all but ignore those nets and prioritize routing the intra-submodule nets. The completed design has routes which correspond to any of the solid line nets in Figures 3.1 and 3.3, depending on whether or not an output-collector is required for the submodule. Barrier Impact on Implementation Negligible As explained in the MAP and PAR subsections, the presence of the barriers wrapping the submodule has a negligible impact on the quality of the placement and routing of the final submodule implementation. At the very least, the HM barrier method for independent submodule implementation provides the closest known method to truly independent submodule implementation through the Xilinx EDA flow. Therefore, this method is utilized in later chapters to determine how area constraints impact independent submodule implementations. 24

33 CHAPTER 4. AREA CONSTRAINT IMPACT ON SUBMODULE TIMING Using the methodology described in Chapter 3 for independent submodule implementation, this chapter focuses on how area constraints impact submodule minimum clock period. In particular, this chapter provides data that helps to answer the following questions that are of keen interest to any designer creating FPGA floor-plans for Xilinx devices. What aspect ratios are best for submodules that comprise the floor-plan? How much area should be allocated for a submodule? What impact do area constraints have on the minimum clock period for a submodule? What guidelines should be observed when assigning submodules to physical locations on the FPGA? The answers to these questions are found through approximately one million submodule implementations across several carefully chosen submodules. Area constraints of varying aspect ratio and total area are applied to the submodules and the resulting implementation minimum clock periods are compared. 4.1 Submodule Experiments Setup To determine how area constraints impact submodule timing, an experiment needs to be fairly large in scale. The randomization involved in placement leads to varied timing results. A single implementation only provides a single point of data in what can be a relatively large range of implementation results. However, large numbers of implementations can expose generalities. To run an adequate number of submodule implementations, this work was supported by the Brigham Young University Fulton Supercomputing Lab, requiring nearly one million processor hours of Xilinx implementation runtime. 2

34 4.1.1 Representative Submodules The submodules selected for testing are listed in Table 4.1 along with their XST estimated resource estimates for a Virtex device. They represent a sampling of a few dataflow submodules as well as some control-heavy submodules. The Microblaze (MB) and Picoblaze (PB) microcontrollers are available from Xilinx, as well as the CoreGen 18-bit, LUT-based, pipelined multiplier (Mult). The 128-tap 18-bit finite impulse response (FIR) filter and double-precision (64-bit) floating point quadratic equation solver (FP) are generated from C++ code by Xilinx AutoESL. The 1024-tap 18-bit fast Fourier transform (FFT) originates from a design in System Generator FPGA Device Selection Two different devices are selected from the Xilinx Virtex family as targets for implementation. The first device is the xcvsx240tff device, which is the largest SX series device. The SX series has the greatest ratios of DSP and BRAM columns in relation to CLB columns. This gives the SX series architectures a more regular structure, having DSP and BRAM columns spread throughout the fabric. This feature, coupled with the large device size, allows for more varied area constraints to be explored across all of the submodules. The second device is the xcvlx330tff device, which is the largest LX series device. The LX series, in contrast to the SX, has fewer DSP and BRAM columns, and the only two DSP columns are only a few columns apart on one half of the device. The LX series, is therefore not optimal for exploring varied area constraints. However, it is instructive to notice what area constraints are preferable in a device with a less regular distribution of resources. Table 4.1: Lists the sample of submodules and the XST estimated resource utilizations when targeting a Virtex device. Submodule Logic LUTs Memory LUTs Registers BRAMs DSPs FFT FIR FP MB Mult PB

35 4.1.3 Establishing Baseline Submodule Clock Constraints Before applying any area constraints, a baseline clock constraint is selected for each submodule. A baseline clock constraint is selected for each submodule that is not simple to achieve, yet not overly difficult. Each submodule is placed and routed over a range of 100 ps steps of clock constraint without any area constraints. An example of this is shown in Figure 4.1 for the FIR submodule on the SX part. Each clock constraint is implemented for each of 100 MAP placement seeds, specified with the MAP -t option. At the high end of the clock period constraint range, all implementations meet their constraint. At the low end, none of the implementations meet their constraint. The first constraint that is realizable for the FIR submodule on the SX part is 3.3 ns, but only 2% of implementations achieve that. This constraint may be unrealizable as soon as the FIR submodule is placed in a full design 1. The next constraint, 3.4 ns, is realized by 3% of the implementations, and is selected as the clock constraint which is applied in conjunction with area constraints. The baseline clock constraints selected for each of the submodules for both the SX and LX parts are selected in the same manner as the FIR baseline clock constraint. They are listed in Table 4.2 along with the percentage of implementations which meet that clock constraint without area constraints. 1 If the probability model of a design implementation meeting timing is simplified such that it is the probability of each independent submodule implementation meeting timing, then a design with two submodules each with only 2% probability of meeting timing has the effectively unrealizable.04% probability of meeting timing. Table 4.2: Lists the selected baseline clock period timing constraints for each submodule for both the SX and LX series parts. Also lists the percentage of implementations which met timing without any area constraints (implemented over 100 MAP seeds). Submodule SX Clock Met Timing LX Clock Met Timing FFT 2.8 ns 9 % 2.8 ns 11 % FIR 3.4 ns 3 % 4.0 ns 20 % FP.0 ns 23 %.0 ns 12 % MB.0 ns 27 %.0 ns 38 % Mult 2.7 ns 41 % 2.7 ns 27 % PB 2.8 ns 27 % 2.8 ns 20 % 27

36 Clock Periods Achieved over 100 Implementations % 0% 2% 3% 63% 76% 89% 98% 100% 100% Clock Period Constraint Figure 4.1: FIR submodule implemented without area constraints, but with varying clock period constraints. At each constraint, the clock period achieved is plotted for 100 placement seeds. The percentage of implementations that met the constraint is also indicated along the horizontal axis. The shaded region overlaps all implementations that met their constraints Variation of Area Constraints Only rectangular area constraints are considered in these experiments. Obscure area constraints are avoided not only to simplify the exploration, but also because area constraints do not constrain the routing, so routing will naturally overlay a rectangular area even if the area of the placement is an L or T shape. The rectangular area constraints are varied in two dimensions and are based on the tile grid. 28

37 The first dimension is the aspect ratio of the rectangle defined by Equation 4.1: Aspect Ratio = AreaConstraint TileWidth AreaConstraint Tile Height. (4.1) For the implementations of the submodules, the aspect ratio is allowed to vary over all unique ratios from combinations of integers between 1 and (1:1, 1:2, 1:3, 1:4, 1:, 2:1,...,:4). The second dimension is the resource area overhead percentage defined by Equation 4.2: ( ) ResourcesWithin Area Area Overhead % = Estimated Resources Required (4.2) By this definition, an area overhead of 0% implies that the area constraint encloses just the required number of resources estimated by synthesis, while an area overhead of 0% implies that the area constraint encloses 0% more resources than the required number of resources. For the implementations of the submodules, the overhead is allowed to vary between 0% and 10% at 10% increments Automatic Generation of Area Constraints An area constraint generator Java package was written to utilize RapidSmith and automatically generate area constraints with many combinations of aspect ratio and area overhead for each of the submodules. As shown in Figure 4.2, the area constraint generator annotates an input UCF file with submodule area constraints based on the other inputs. The other inputs include, the XST synthesis report file (.SRP), the target Xilinx part, the area constraint aspect ratio and overhead, and, optionally, a seed tile. The process selects a rectangular area for the submodule centered on the seed tile with the requested aspect ratio and overhead based on the estimated resource utilization as reported in the SRP. If a seed tile is not specified, an appropriate one is chosen for the submodule based on the types of resources utilized. In all of the following experiments, a seed tile is specified for each of the submodules. The tile is chosen to be near DSPs or BRAMs if they are required by the submodule, or to be in a CLB-rich region if they are not. It is chosen to be nearly centered on one side of the device s central clock column. This location is chosen to maximize the allowable horizontal and vertical 29

38 .SRP Xilinx Part Aspect Ratio Area Overhead (Optional) Seed Tile.UCF Area Constraint Generator.UCF Figure 4.2: The area constraint generator takes as input a UCF file name which it annotates with submodule area constraints based on the other inputs. growth of the various area constraints considered while avoiding, as much as possible, horizontal growth that crosses the central clock column. It has been noticed that wire segments that cross the central clock column can have greater delays than corresponding wire segments elsewhere on the device [1]. Because the area constraints are generated to be rectangular, it is important to note that the target area overhead may be overshot by a slight amount. When extending the rectangle in either height or width, the only option is to add an entire row or column of tiles. Extra tiles may be added in this manner. It is also important to note that some submodules may have disproportional resource type requirements. The rectangular area constraint will grow in size until all resources have had their overhead satisfied. This can lead to the submodule area constraint having a greater actual overhead on the less utilized resource types. Figure 4.3 provides an example in which the DSP resource requirement of the submodule is disproportionately large compared to the BRAM and CLB requirements. While the BRAM and CLB resources could fit into the marked regions, such an area constraint can cause nets that need to be routed long distances between the different resource types. The composite area constraint allows more BRAM and CLB resources than specified by the area overhead in order for BRAM and CLB resources to freely fill the DSP area and reduce routing distances. 30

39 BRAM Requirements CLB Requirements DSP Requirements Composite Area Constraint Figure 4.3: Example of a submodule with a disproportionately large DSP resource requirement. While the BRAM and CLB resources could fit into the marked regions, the composite area constraint will allow more BRAM and CLB resources than specified by the area overhead Submodule Implementation with Various Area Constraints Using the automatic constraint generator and a few thousand computers available at the Brigham Young University Fulton Supercomputing Lab, each submodule is implemented for each of 100 MAP placement seeds for each combination of aspect ratio and overhead, requiring 30,400 iterations of Xilinx v13.1 NGDBuild, MAP and PAR for each submodule. The PAR options are left defaulted, while the MAP options include -t to vary the placement seed and -u to prevent circuit annihilation. The resulting minimum clock period of each implementation is obtained using Xilinx TRCE, and the results are presented in the next section. 4.2 Area Constraint Impact on Submodule Minimum Clock Period The results of the area constraint experiments described in the previous section for both the SX and LX series parts are shown in Figures 4.4 and 4., respectively. Each graph displays the percentage of implementations of a specific submodule which met its clock period constraint at each combination of aspect ratio and area overhead. 31

40 Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) % of FFT Implementations Meeting 2.8 ns Resource Area Overhead (%) % of FP Implementations Meeting.0 ns Resource Area Overhead (%) % of Mult Implementations Meeting 2.7 ns Resource Area Overhead (%) Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) % of FIR Implementations Meeting 3.4 ns Resource Area Overhead (%) % of MB Implementations Meeting.0 ns Resource Area Overhead (%) % of PB Implementations Meeting 2.8 ns Resource Area Overhead (%) Figure 4.4: SX Series. Each graph shows the percentages of implementations of a specific submodule meeting its specified clock constraint when rectangular area constraints are applied. The rectangle aspect ratio (width/height) varies along the vertical axes, while the area overhead varies along the horizontal axes. The percentage of implementations meeting timing is mapped to the gradient bar at the right of each graph. 32

41 Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) % of FFT Implementations Meeting 2.8 ns Resource Area Overhead (%) % of FP Implementations Meeting.0 ns Resource Area Overhead (%) % of Mult Implementations Meeting 2.7 ns Resource Area Overhead (%) Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) Tile Area Aspect Ratio (W/H) % of FIR Implementations Meeting 4.0 ns Resource Area Overhead (%) % of MB Implementations Meeting.0 ns Resource Area Overhead (%) % of PB Implementations Meeting 2.8 ns Resource Area Overhead (%) Figure 4.: LX Series. Each graph shows the percentages of implementations of a specific submodule meeting its specified clock constraint when rectangular area constraints are applied. The rectangle aspect ratio (width/height) varies along the vertical axes, while the area overhead varies along the horizontal axes. The percentage of implementations meeting timing is mapped to the gradient bar at the right of each graph. 33

42 4.2.1 SX Series Results The SX series provides the xcvsx240t as the ideal device for exploring the most varied area constraints since DSP and BRAM resources are evenly distributed throughout the fabric of the FPGA. A somewhat unexpected generality found in any submodule other than the FIR filter is that any combination of aspect ratio and area overhead can meet a reasonably difficult clock period. This is noticed by the existence of non-zero percentages at any point in the graphs of Figure 4.4 for any submodule except the FIR submodule. This is not necessarily intuitive because it is generally understood that deviating from a square aspect ratio can increase total wire length and that restricting area can adversely affect the quality of the placement or subsequent routability. This may be partially explained by the fact that the submodules all have at least about a 1:1 LUT to register ratio, so there should be a small percentage of nets which could become the critical path. Area Overhead Effects Although generally any area overhead can meet the target clock constraint, it is noticed that very low area overheads of 0-20% tend to have the fewest implementations that meet the clock constraints, often fewer than when the submodule is implemented without an area constraint. An exception to this occurs when a near ideal aspect ratio is used in conjunction with low overhead such as seen with the Mult submodule at an aspect ratio of 0- and overhead of 0%. The Mult submodule has a large number of vertical shift and carry chains and a small overall area, so it prefers area constraints with lower (vertically oriented) aspect ratios. As the area overhead increases, there is more room for the placer to skew a submodule away from the area constraint s aspect ratio and toward an aspect ratio more preferable for the submodule, whatever that may be. Therefore, higher area overhead can lead to a greater percentage of the implementations meeting the timing constraints. Generally, above 20% overhead, the percentages of implementations meeting the timing constraints are similar to when the submodule is implemented without area constraints. The Microblaze and Picoblaze, however, display positive exceptions to the trend with dramatic improvements when implemented with area constraints. The Microblaze in particular only met its timing 34

43 constraint in 27% of implementations without area constraints, yet most of the results with area constraints meet the timing constraints in over 80% of implementations. Aspect Ratio Effects Aspect ratio generally appears to have little effect on any of the results once area overhead exceeds about 20%. At lower area overheads, the effects of aspect ratio are most pronounced at the extremes. In particular, the Microblaze and Picoblaze implementations suffer at low (vertically oriented) aspect ratios and low overhead. This is in contrast to the Mult submodule which shows a strong preference for low aspect ratios. Meanwhile, the FP submodule has greater difficulty at both the high and low aspect ratio extremes with low overhead. Breaking the generality, the FIR submodule implementations simply will not meet its target clock constraint except at moderate aspect ratios, regardless of the area overhead. The FIR filter demonstrates an exception to the generalities noticed among the other submodules. This may be due to its composition. The FIR submodule differs from the other submodules by utilizing a disproportionately large number of DSPs in relation to the number of other resources. The overall tile area selected is dominated by the high DSP requirement. This condition makes it difficult for the relatively few registers to adequately fill in the total area, requiring longer paths between registers. Path delays are found to be minimized when the tile area is minimized and the tile aspect ratio is near 2.0. The actual underlying physical geometries of the tiles are unknown, but it is noticed that wire delays are about equal for an equivalent number of interconnect tile hops in either the horizontal or vertical direction. Consider a wire beginning at the lower left interconnect tile in Fig The nearest interconnect tile to the north is one tile away, while the nearest interconnect tile to the east is three tiles away. The wire going to either has essentially the same delay. Over a large range of tiles on any Xilinx device, the average horizontal spacing of interconnect tiles is about double that of the average vertical spacing. This leads to a preferred tile aspect ratio of 2.0 to minimize path delays. Therefore, at low area overhead, the FIR implementations which meet the clock constraint form a band centered on an aspect ratio of 2.0. It is noticed that the lower bound on the aspect ratio of the FIR submodule decreases as the area overhead increases. This is easily explained because as the area overhead increases, progressively lower (vertically oriented) aspect ratio constraints can still fit a higher (horizontally oriented) 3

44 aspect ratio placement within the area boundaries. The decrease in the aspect ratio upper bound as the overhead increases is not as intuitive. The reason for this trend has to do with crossing the central clock column of the device. Central Clock Column Effects As seen in Figure 2.1, the central column of Xilinx devices contains specialized tiles for clocks, I/O, etc. Wire segments which cross the central column can have greater delay than corresponding wire segments elsewhere on the device. The exact differences in delay depend on segment length and load, but can be on the order of hundreds of picoseconds [1]. Slightly increased delays can also be found on wires crossing horizontal clock region boundaries, but these delays are only on the order of tens of picoseconds. The upper bound on the aspect ratio for the FIR submodule corresponds precisely with the area constraint crossing over the central column, causing routes that cross the central column to have increased delay. However, it should be noted that the central column effect may be overly pronounced in the FIR submodule because of the impact of the disproportionate number of resources. The FP submodule spans the central column at higher (horizontally oriented) aspect ratios as well, yet it has many implementations meeting the clock constraint in those cases. So, while it does not disrupt all implementations, an area constraint which includes the central column may have a negative effect on timing LX Series Results While the SX series sheds light upon area constraint impact on timing in general, the LX series devices do not have such an even distribution of resources on the FPGA fabric. Therefore, the results differ somewhat for submodules targeting the xcvlx330t device. Area Overhead Effects In general, the area overhead impact on timing in the LX series part is very similar to that of the SX series. Again, low area overheads of 20% or less are correlated with lower percentages of implementations that meet timing. 36

45 Again in most cases, the Microblaze and Picoblaze microcontrollers tend to have a significantly greater number of implementations meeting timing with area constraints than without. Without area constraints, the Microblaze hit.0 ns in 38% of the implementations. For the majority of the area constraints applied to the Microblaze, it hit.0 ns in over 80% of the implementations. The LX series FIR submodule graph is markedly different from the SX series. First of all, the LX series device did not have enough DSPs to support area overheads greater than 90%, so the graph contains no data at overheads greater than 90%. Secondly, the target minimum clock period is 4.0 ns for the LX part versus 3.4 ns for the SX part. No implementations without area constraints could manage to hit anything lower than 4.0 ns on the LX part. This is due to the disproportionately large number of DSPs required by the FIR submodule and the arrangement of DSPs on the LX device. The FIR submodule requires over 0% of the DSPs available on the device and the DSPs are only available in two columns. This causes the submodule to be stretched vertically to reach enough DSPs, and results in higher minimum clock periods. Lastly, the LX FIR graph data is greatly skewed because of the high submodule DSP requirement and vertical arrangement of DSPs on the LX device. In order to satisfy the submodule DSP requirements, every area constraint needed to be at least half the height of the device to meet the submodule s requirement of a little over half the device s DSPs. Because the area constraint generator maintains the target aspect ratio while satisfying area overhead, the resulting area constraints can have much greater area overhead than the target input area overhead. For example, a horizontally-oriented aspect ratio of.0 with a target area overhead of 0% grows at that aspect ratio until its height is just over half the device s height before it finally satisfies the FIR s DSP requirement with 0% DSP area overhead. At this point, to maintain its aspect ratio, its width should have grown to about 2.x the width of the device, but it is cut off at the device width. The corresponding BRAM and CLB actual area overhead percentages now reach into the hundreds. This behavior did not occur on the SX part because additional DSPs are available by horizontal or vertical area constraint expansion in the SX series, while DSPs are effectively only available through vertical area constraint expansion in the LX series. 37

46 Aspect Ratio Effects Implementation minimum clock periods on the LX device are more sensitive to aspect ratio than they are on the SX device, but only so in submodules that have DSP requirements. This is again related to the vertical DSP resource arrangement. The FFT submodule requires enough DSPs that it has a slight preference for a lower (vertically oriented) aspect ratio. The Microblaze, however, only requires 3 DSPs, and then needs many SLICEs. Low (vertically oriented) aspect ratios provide a lot of DSPs which really only serve to make paths between SLICEs longer. Therefore, the Microblaze has more difficulty at higher aspect ratios. The problematic FIR submodule on the LX device still prefers a moderate aspect ratio of about 2.0, but it has lost the distinctive graph banding that it had on the SX device. It is not entirely missing, however. The low (vertically oriented) aspect ratio portion of the LX FIR graph has low percentages of implementations that met timing, and even some holes where no implementations met timing. That region corresponds with the same empty region in the SX FIR graph for the same reasons it was empty on the SX device. The aspect ratio is too extreme for the disproportionately high DSP utilization. The higher (horizontally oriented) aspect ratios on the LX device are different, however. The high FIR submodule DSP requirements and the few, vertically arranged DSPs on the LX device places a lower bound on the height of the area constraint at roughly half the device height. Therefore, beyond a certain point, as the aspect ratio increases, since there is a minimum area constraint height, the area constraint increases in width to meet the aspect ratio. At a fixed area overhead, once the aspect ratio is high enough that the area constraint height is minimized, any further increase in aspect ratio is effectively only additional area overhead for SLICEs and BRAMs. This has little effect on the FIR implementation, so it shows up in the graph as the moderate aspect ratio data streaking upward into higher aspect ratios. The high aspect ratio data in the LX FIR graph is therefore not very helpful and can be profitably ignored. Central Clock Column Effects The central clock column did not have any noticeable effect on the LX series results. The only noticeable effect in the SX series was on the FIR filter. In the SX series, there are DSPs available on both sides of the central clock column, while they are only available on one side in 38

47 the LX series. Therefore, implementations in the LX series often did not cross the central clock column even when it was available in area constraints with large enough area overhead. 4.3 Conclusions The results from both the SX and LX series experiments profile how area constraints affect timing in submodules independent of a full design. They help answer the following questions: What aspect ratios are best for submodules that comprise the floor-plan? If resource types are evenly distributed throughout the FPGA fabric, submodules are generally aspect-ratio agnostic. The exceptional case, however, indicates that the tile aspect ratio should be selected to be near 2.0 if possible. If resource types are unevenly distributed throughout the FPGA fabric and a submodule uses that particular resource type, then that submodule can have a slight preference in aspect ratio depending on its composition. If it uses many of the unevenly distributed resource type, then it prefers an aspect ratio that aligns with that resource s distribution in the FPGA fabric. If it uses few of that resource type then it prefers that its aspect ratio does not align with that resource s distribution in the FPGA fabric. How much area should be allocated for a submodule? The minimum area overhead of 0% can provide implementations that meet the clock constraint, however, to meet timing most often, an area overhead of at least 20% is preferred. What impact do area constraints have on the minimum clock period for a submodule? In general, area constraints do not prohibit meeting a submodule s minimum clock period. Variations of aspect ratio and area overhead can impact how often the minimum clock period is achieved, but it is achievable with an area constraint. With an area overhead of 20% or more, the percentage of area-constrained submodule implementations that meet timing is about equal to the percentage of submodule implementations that meet timing without any area constraints. For some submodules, introducing area constraints can substantially improve how often the minimum clock period is met. In the set of submodules considered in this work, this occurred in the control-heavy Microblaze and Picoblaze microcontrollers. The other submodules were pipelined dataflow operators. This suggests that introducing area constraints helps meet timing in control- 39

48 heavy submodules, though a greater size set of submodules would have to be studied to strengthen or dismiss the correlation. What guidelines should be observed when assigning submodules to physical locations on the FPGA? If possible, submodules should not be assigned to areas that cross the central clock column of the device, since wires that cross the central clock column can have greater delays than corresponding wires elsewhere on the device. Within that restriction, a submodule should simply be placed in an area of the FPGA fabric that reflects its resource requirements. 40

49 CHAPTER. FLOOR-PLANNING IMPACT ON DESIGN TIMING While Chapter 4 examines how area constraints impact submodule timing, this chapter investigates how floor-planning affects design timing. Through several experiments, this chapter determines if the conclusions drawn in Chapter 4 for independent submodule implementations are still valid when the submodules are placed in full design contexts. In particular, several questions are answered. Can a floor-planned design meet timing with any combination of submodule aspect ratio and area overhead? Is a moderate aspect ratio preferred for submodules that comprise the floor-plan? Is an area overhead greater than 20% prefered for submodules that comprise the floor-plan?.1 Test Designs Two different test designs are considered in the experiments. Both designs target the xcvsx240t device. A LX series device was not considered in the experiments because it does not provide enough DSPs for either of the designs. The first design, FP-Pipe is a synthetic design which combines six of the FP submodules in a pipeline to provide high device utilization while allowing for extremely regular floor-plan layouts. The second design, MIMO, is a realistic multiple-input multiple-output radio design composed of transforms, filters, and decoders. The MIMO design is less ideal for testing varied submodule area constraints, but presents realistic floor-planning constraints that a designer may encounter. The device usage estimates for each of the designs are shown in Table.1, and the actual device utilizations from a single run of each design are shown in Table.2. 41

50 Table.1: Lists the XST synthesis xcvsx240t device utilization estimates of each design. Design LUTs % Registers % BRAMs % DSPs % FP-Pipe 82,320 4% 129,04 86% 72 13% % MIMO 41,724 28% 89,60 60% % % Table.2: Lists the actual xcvsx240t device utilization for a single run of each design. Design LUTs % Registers % BRAMs % DSPs % FP-Pipe 81,88 4% 128,976 86% 72 13% % MIMO 38,40 2% 83,636 % % %.2 FP-Pipe Design The FP submodule from Chapter 4 is able to meet a.0 ns minimum clock period. However, when six of the FP submodules are connected in series into a single design, the design is not able to meet a.0 ns minimum clock period for any of 100 placer seeds. This result is not altogether surprising. The FP submodule without area constraints only meets a.0 ns minimum clock period in 23% of implementations. Using a simplified probability model where the implementation of each of the FP submodules is independent of the others in the FP-Pipe design, one could expect the FP-Pipe design to meet a.0 ns minimum clock period with a probability of (.23) 6 = 0.014%. Therefore, it is highly unlikely that a FP-Pipe design implementation will meet a.0 ns minimum clock period. The FP-Pipe design cannot meet a.1 ns minimum clock period either, but it meets a.2 ns minimum clock period constraint in 78% of implementations without floor-planning. Trace reports indicate that the critical paths in the FP-Pipe design are intra-submodule paths rather than the new inter-submodule paths introduced in the design. This shows that a submodule that is placed in a full design context may no longer be able to meet as tight a timing constraint as it can independent of a design context. This concept was brought up in Chapter 4 and is the reason why the tightest timing constraints were avoided when establishing baseline submodule clock constraints. 42

51 .2.1 Independent FP Submodule Implementations with.2 ns Constraint The percentages of implementations of the independent FP submodule that meet a.2 ns minimum clock period at varying aspect ratios and overhead are shown in the upper graph of Figure.1. For comparison, 81% of independent FP submodule implementations meet a.2 ns minimum clock period without any area constraints. Most combinations of aspect ratio and area overhead meet the clock constraint constraint in 80-90% of implementations. This percentage drops at lower area overheads and more extreme aspect ratios. Similar to the.0 ns constraint as seen in Figure 4.4, the independent FP submodule prefers moderate aspect ratios. When considering the full FP-Pipe design, not all aspect ratio and area overhead combinations can be implemented. With six of the FP submodules on the xcvsx240t device in the FP-Pipe design, the register utilization reaches 86%. Therefore, if the extra area is distributed evenly among the submodules, each submodule is allowed about 1% area overhead. Because the independent FP submodule results indicate that a higher percentage of implementations meet a.2 ns minimum clock period when the area overhead is maximized, all of the submodules in the FP-Pipe design are allowed 1% area overhead. The lower graph of Figure.1 shows an interpolated cross section of the upper graph at 1% area overhead. The interpolation simply averages the 10% and 20% overheads and extends the slopes of the graphs at extreme aspect ratios to include 0.16 and 6.0 aspect ratios. This cross section indicates that the greatest percentage of implementations should meet timing when moderate submodule aspect ratios are selected in the FP-Pipe floor-plan. In particular, the FP submodule with an aspect ratio between and meets the.2 ns minimum clock period in over 90% of implementations. Meanwhile, implementations with extreme aspect ratios meet the clock constraint in less than 70% of implementations..2.2 FP-Pipe Design Floor-plans The FP-Pipe design is floor-planned in the three different configurations shown in Figure.2. Xilinx PlanAhead was used to perform the floor-planning. The submodules are constrained to have an area overhead of about 1%, so each floor-plan covers the entire device area. The submodule tile aspect ratios of each of the three floor-plans is different. The first floor-plan has submodules with an aspect ratio of 0.16 (vertically oriented) and arranges each of the submod- 43

52 % of FP Implementations Meeting.2 ns Tile Area Aspect Ratio (W/H) Resource Area Overhead (%) 0 % of FP Implementations Meeting.2 ns Interpolated FP Cross Section at 1% Area Overhead Tile Area Aspect Ratio (W/H) Figure.1: Upper graph shows the percentages of implementations of the independent FP submodule meeting a.2 ns clock constraint when rectangular area constraints are applied. The rectangle aspect ratio (width/height) varies along the vertical axis, while the area overhead varies along the horizontal axis. The percentage of implementations meeting timing is mapped to the gradient bar at the right. The lower graph shows an interpolated cross section of the upper at 1% area overhead. 44

53 ules left to right in their pipeline order. The second floor-plan has submodules with an aspect ratio of 1.37 (nearly square) and arranges the submodules counterclockwise in a U in their pipeline order. The third floor-plan has submodules with an aspect ratio of.93 (horizontally oriented) and arranges each of the submodules top to bottom in their pipeline order. These three floor-plans are chosen as essentially the only three available where each of the FP submodules in a particular floor-plan have similar area constraints. They provide a moderate aspect ratio test case as well as test cases at extreme aspect ratios..2.3 FP-Pipe Floor-plan Implementation Results Each of the floor-planned designs as well as a design without a floor-plan were implemented for each of 100 placer seeds with a.2 ns minimum clock period constraint. The percentages of implementations which met the clock constraint for each design are shown in Table.3. The design without a floor-plan met the clock constraint in 78% of implementations. This is exceeded by the moderate aspect ratio floor-plan which met the clock constraint in 9% of implementations. The extreme aspect ratios did not have such high percentages. The.93 (6/1) aspect ratio had a percentage comparable to the design without a floor-plan, but the.16 (1/6) aspect ratio only met timing in 32% of implementations. Although the extreme aspect ratios are approximately the reciprocal of one another, the results show that the.93 aspect ratio as clearly superior to the 0.16 aspect ratio. This is likely because the.93 aspect ratio is nearer to the 2.0 (2/1) aspect ratio in which intra-submodule path delays should be minimized. The results of the FP-Pipe full design implementations strongly correlate with the results of independent FP submodule implementations of Figure.1. For both the independent FP submodule and the full FP-Pipe design, the moderate aspect ratio met timing in about 9% of im- Table.3: Percentage of FP-Pipe implementations meeting a.2 ns minimum clock period. Aspect Ratio % Met No Floor-plan 78% % %.93 7% 4

54 0.16 Aspect Ratio 1.37 Aspect Ratio.93 Aspect Ratio Figure.2: PlanAhead views of three FP-Pipe floorplans each with different tile aspect ratios. 46

55 plementations, obtaining better results than the design without a floor-plan. Also, for both the independent FP submodule and the full FP-Pipe design, the extreme aspect ratios met timing in a considerably lower percentage of implementations, with the high (horizontally-oriented) aspect ratio outperforming the low (vertically-oriented) aspect ratio. The FP-Pipe design shows that an independent submodule implementation gives a good indication of how the submodule will behave under similar area and timing constraints in a full design..3 MIMO Design The full MIMO design consists of many discrete-time signal processing submodules. It was developed using the Xilinx System Generator Simulink environment, so it has a structural architecture. The design heavily utilizes DSPs and BRAMs, and it is often the BRAM-based memory structures which contain the design s critical paths. The design is able to meet a 4.8 ns minimum clock period in 12% of implementations without a floor-plan but cannot meet a 4.7 ns clock period..3.1 MIMO Design Floor-plans Of the submodules presented in Chapter 4, the MIMO design contains the FFT and Picoblaze submodules. These submodules, however, are able to hit minimum clock periods much lower than 4.8 ns, so they do not contribute to the design s critical path. A simple experiment which only floor-planned the FFT submodule in the MIMO design showed that floor-planning the FFT had a negligible impact on the resulting design clock period. Figure.3 shows the percentages of implementations of the MIMO design meeting a 4.8 ns clock constraint when only the FFT submodule is floor-planned. The nearly flat result space indicates that regardless of the area constraint applied to the FFT submodule, the design still met the 4.8 ns clock constraint about as often as it did without any floor-planning. Even with area constraints, the FFT did not become the critical path because it has so much slack with a 4.8 ns constraint. Rather than unprofitably floor-plan the FFT or Picoblaze submodules, four large submodules were selected from the MIMO design which contained potential critical paths. In the order they appear in the pipeline, these four submodules are multiband correlator, frequency estimator, 47

56 % of MIMO Implementations Meeting 4.8 ns Tile Area Aspect Ratio (W/H) Resource Area Overhead (%) 0 Figure.3: Shows the percentages of implementations of the MIMO design meeting a 4.8 ns clock constraint when rectangular area constraints are applied to the FFT submodule. The rectangle aspect ratio (width/height) varies along the vertical axis, while the area overhead varies along the horizontal axis. The percentage of implementations meeting timing is mapped to the gradient bar at the right. metric computer, and trellis decoder. Each of the submodule resource utilization estimates are listed in Table.4. Unlike the FP-Pipe design, only portions of the MIMO design are floor-planned. This allows for greater freedom in choosing submodule area overhead than was available in the FP- Pipe design. While there is greater freedom in choosing submodule area overhead, when the area overhead grows, the aspect ratio range available for larger submodules becomes restricted by the physical dimensions of the device. Maintaining a uniform aspect ratio within a particular 48

57 Table.4: Lists the XST synthesis xcvsx240t device utilization estimates for each of the MIMO submodules selected for floor-planning. Submodule LUTs Registers BRAMs DSPs Multiband Correlator 14,89 30, Frequency Estimator 7,173 14, Metric Computer 4,088 6, Trellis Decoder 9,23 20, floor-plan is not always possible in the MIMO design because of the different submodule resource requirements. By varying area overhead and aspect ratio, nine different floor-plans are applied to the MIMO design. 10% Area Overhead MIMO Floor-plans The first grouping of three floor-plans, shown in Figure.4, all have submodule area overheads of 10%. The first floor-plan with 10% area overhead has submodules with low (verticallyoriented) aspect ratios of 0.19 to 3 arranged from left to right in the order they occur in the pipeline. In this floor-plan, with 10% area overhead, the multiband correlator submodule reaches the vertical limits of the device, so it has a lower bound on its aspect ratio at 3. The second floor-plan with 10% area overhead has submodules with moderate aspect ratios of 1.11 to 2.02 arranged counterclockwise in a C layout in the order they occur in the pipeline. In this floor-plan, with 10% area overhead, the multiband correlator submodule reaches an upper bound on its aspect ratio at 1.11 without crossing the central clock column. The final floor-plan with 10% area overhead has submodules with high (horizontally-oriented) aspect ratios of 4.44 to.3 arranged from top to bottom in the order they occur in the pipeline. In this floor-plan, with 10% area overhead, the multiband correlator submodule reaches the horizontal limits of the device, so it has an upper bound on its aspect ratio at % Area Overhead MIMO Floor-plans The second grouping of three floor-plans shown, in Figure., all have submodule area overheads of 2%. The submodules in this grouping all have arrangements similar to the previous 49

58 0.19 to 3 Aspect Ratio 1.11 to 2.02 Aspect Ratio 4.44 to.3 Aspect Ratio Figure.4: PlanAhead views of three MIMO floor-plans each with 10% area overhead. 0

59 grouping, however, the aspect ratios change slightly. The first floor-plan with 2% area overhead has submodules with low aspect ratios of 0 to. In this case, the multiband correlator submodule has a lower bound on its aspect ratio at. The second floor-plan with 2% area overhead has submodules with moderate aspect ratios of 0.98 to Here, the multiband correlator submodule has an upper bound on its aspect ratio at The final floor-plan with 2% area overhead has submodules with high aspect ratios of 3.9 to.46. In this floor-plan, the multiband correlator has an upper bound on its aspect ratio at % Area Overhead MIMO Floor-plans The third grouping of three floor-plans, shown in Figure.6, all have submodule area overheads of 0%. The submodules in this grouping all have arrangements similar to the previous grouping, however, the aspect ratios change slightly. The first floor-plan with 0% area overhead has submodules with low aspect ratios of 0.19 to 7. In this case, the multiband correlator submodule has a lower bound on its aspect ratio at 7. The second floor-plan with 0% area overhead has submodules with moderate aspect ratios of 0.82 to Here, the multiband correlator submodule has an upper bound on its aspect ratio at The final floor-plan with 0% area overhead has submodules with high aspect ratios of 3.20 to.48. In this floor-plan, the multiband correlator has an upper bound on its aspect ratio at MIMO Floor-plan Implementation Results Each of the MIMO floor-planned designs described in the previous section as well as one without a floor-plan were implemented over 100 placement seeds for each of 4.8, 4.9, and.0 ns clock period constraints, and the percentage of implementations that met the specific clock period constraint are listed in Table.. The MIMO design without a floor-plan is able to meet a 4.8 ns clock period constraint in 12% of implementations and is the tightest clock constraint applied to all the designs. The 4.9 and.0 ns clock period constraints provide additional resolution in the results. In the case of the MIMO design, enforcing a floor-plan on the design is not clearly better than implementing it without a floor-plan. For most combinations of clock constraint, aspect-ratio, 1

60 0 to Aspect Ratio 0.98 to 2.02 Aspect Ratio 3.9 to.46 Aspect Ratio Figure.: PlanAhead views of three MIMO floor-plans each with 2% area overhead. 2

61 0.19 to 7 Aspect Ratio 0.82 to 2.02 Aspect Ratio 3.20 to.48 Aspect Ratio Figure.6: PlanAhead views of three MIMO floor-plans each with 0% area overhead. 3

62 Table.: Percentage of MIMO implementations meeting a specified minimum clock period. No Floor-plan Aspect Ratio Met 4.8 ns Met 4.9 ns Met.0 ns N/A 12% 49% 80% 10% Area Overhead Floor-plans Aspect Ratio Met 4.8 ns Met 4.9 ns Met.0 ns 0.19 to 3 2% 11% 46% 1.11 to % 3% 63% 4.44 to.3 % 2% 61% 2% Area Overhead Floor-plans Aspect Ratio Met 4.8 ns Met 4.9 ns Met.0 ns 0 to 4% 18% 1% 0.98 to % 38% 73% 3.9 to.46 16% 42% 73% 0% Area Overhead Floor-plans Aspect Ratio Met 4.8 ns Met 4.9 ns Met.0 ns 0.19 to 7 7% 18% 0% 0.82 to % 40% 69% 3.20 to.48 14% 36% 80% and area overhead, the design without a floor-plan meets timing more often than with a floor-plan. This shows that floor-planning is not the end-all solution for meeting timing. Even with a floorplan, there is still a great deal of random number generation in placement. With only two designs considered in this chapter, it is difficult to know how often floor-plans help or hinder design timing. Also, what is not shown by Table. is total development time. Xilinx suggests floor-planning in conjunction with a hierarchical approach to implementation for minimizing development cycles spent meeting timing. Such an analysis is beyond the scope of this work, but would be a natural direction for future work. Regardless of the comparison to the design without a floor-plan, the effect of submodule aspect ratio and area overhead upon design timing can still be compared among the floor-plans. At 10% area overhead, significantly more implementations meet timing with the moderate aspect ratio floor-plan than with either of the extreme aspect ratio floor-plans. This corresponds 4

63 to the independent submodule findings where at or below 20% area overhead, aspect ratio has a significant impact on the percentage of submodule implementations meeting timing, with extreme aspect ratios often having the lowest percentages. The high aspect ratios tend to have more implementations meeting timing than the approximately inverse low aspect ratios. Similar to the FP-Pipe design, this is likely because the high aspect ratios are nearer to 2.0 which minimizes intra-submodule wire delays. At 2% and 0% area overhead, the results are different. The percentages of implementations meeting timing are increased above the 10% area overhead results, and the moderate aspect ratio floor-plan no longer dominates the extreme aspect ratio floor-plans. This also corresponds to the independent submodule findings where above 20% area overhead, aspect ratio no longer has a significant impact on the percentage of submodule implementations meeting timing. Above 20% area overhead, there are many cases where the high aspect ratio floor-plans have a higher percentage of implementations meeting timing than the moderate aspect ratio floor-plans. The low aspect ratio floor-plans still have the lowest number of implementations meeting timing. This again is likely because the low aspect ratio floor-plans are furthest from the 2.0 aspect ratio which minimizes intra-submodule wire delays. The low aspect ratio floor-plans are also able to maintain a more extreme aspect ratio than the high aspect ratio floor-plans when the area overhead is increased because the device naturally has a greater tile height than width..4 Conclusions Overall, the implementation results of both the FP-Pipe and MIMO designs support the conclusions drawn from the independent submodule implementations in Chapter 4. In particular, several questions are answered. Can a floor-planned design meet timing with any combination of submodule aspect ratio and area overhead? Yes. While only a subset of aspect ratio and area overhead combinations are selected for floor-plans in the experiments, both moderate and extreme aspect ratios are tested and all of the floor-plans are able to meet the tightest timing constraint achievable for the designs. This supports the general results of the independent submodule implementations. Is a moderate aspect ratio preferred for submodules that comprise the floor-plan? Yes. When the area overhead is low (10% to 16% in the designs tested), aspect ratio has a sig-

64 nificant impact on the percentage of design implementations that meet timing. In these cases, the moderate aspect ratios (1.11 to 2.02 in the designs tested) have the greatest percentage of implementations that meet timing. This supports the results of the independent submodule implementations, where an aspect ratio near 2.0 is suggested as ideal. Is an area overhead greater than 20% prefered for submodules that comprise the floor-plan? Yes. Floor-planned designs with submodule area overhead above 20% have a greater percentage of implementations meeting timing than those below 20%. This supports the results of the independent submodule implementations. 6

65 CHAPTER 6. OTHER SUBMODULE AREA CONSTRAINT EFFECTS While meeting a specified minimum clock period is often the primary concern when floorplanning, there are other interesting ways that area constraints affect submodule implementations. This chapter examines how area constraints impact submodule routing overflow, submodule relocation, and implementation runtimes using the same submodules introduced in Chapter Submodule Routing Overflow While it is possible to use area constraints to restrict the placement of submodules to specific regions of the FPGA fabric during MAP, it is not reasonable to restrict the all of the routing during PAR. PAR only respects directed routing constraints, where each directed routing constraint maps a specific net to specific routing resources [17]. Therefore, PAR in the conventional EDA flow almost invariably generates submodule implementations with routing which oversteps the area constraint boundaries on intra-submodule routes. This routing behavior is termed routing overflow and is shown in Figure 6.1. In conventional implementation flows, submodule routing overflow is not a concern. However, in partial reconfiguration (PR) or incremental flows, routing overflow can become troublesome. In a PR flow, all intra-submodule routing must be contained within the area constraint which corresponds to the partially reconfigurable partition of the design because areas outside of that region will not be reconfigured at runtime. In a hierarchical incremental partition flow, submodules may be imported from previous implementations. While two imported submodules may be near each other if they have non-overlapping area constraints, it often occurs that they have conflicts among routing resources due to routing overflow. In the incremental HMFlow [22], where submodules are relocatable and placed post-implementation, routing overflow increases the effective submodule area and limits device utilization. In these situations, it is helpful to minimize the amount of routing overflow. 7

Figure 6.1: Shows a RapidSmith view of a Picoblaze microcontroller submodule implemented on the xcvsx240t device. The tiles with utilized device resources are shaded.

66 Figure 6.1: Shows a RapidSmith view of a Picoblaze microcontroller submodule implemented on the xcvsx240t device. The tiles with utilized device resources are shaded. The thick white line indicates the area constraint applied to the submodule during implementation. The shaded interconnect tiles outside of the area constraint are from routes which overflow outside of the area constraint Submodule Routing Overflow Results The data for routing overflow is available from the same set of experiments presented in Chapter 4. Only intra-submodule routes are considered when determining routing overflow. Routes to or from the HM barriers are not included. Routing overflow is calculated as a percentage in Equation 6.1: Routing Over f low % = ( ) Total Utilized Tiles Outside AreaConstraint 100. (6.1) Total TilesWithin AreaConstraint This measure indicates how much greater the effective area of the submodule is than the area it is constrained to 1. In Figure 6.1, the area within the area constraint is 60 tiles, and there are 1 A finer granularity metric such as the percentage of Programmable Interconnect Points (PIPs) utilized outside the area constraint may be more appropriate for most applications, but the routing overflow data was generated in the context of HMFlow [22] where the coarse tile granularity is the preferred metric of comparison for placement purposes. 8

L12: Reconfigurable Logic Architectures

L12: Reconfigurable Logic Architectures Acknowledgements: Materials in this lecture are courtesy of the following sources and are used with permission. Frank Honore Prof. Randy Katz (Unified Microelectronics