Exploring Architecture Parameters for Dual-Output LUT based FPGAs

Exploring Architecture Parameters for Dual-Output LUT based FPGAs Zhenghong Jiang, Colin Yu Lin, Liqun Yang, Fei Wang and Haigang Yang System on Programmable Chip Research Department, Institute of Electronics, Chinese Academy of Sciences, Beijing, China The University of Chinese Academy of Sciences, Beijing, China Corresponding Author: yanghg@mail.ie.ac.cn

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

MOTIVATION Architecture Parameters Exploration is always a key part in FPGA-researches: Look-up Table Size, Cluster Size and Inputs, etc... Dual-Output Look-up Table replace traditional one to become the mainstream solution in FPGA chips: stratix II, III,... virtex 5, 6,... However, no published research explores design parameters for dual-output LUT based architecture. Although not innovative, a careful exploration of design parameters for the new FPGA architecture is still very necessary and helpful.

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

Parameters To Explore Traditional Parameters Look-up Table Size Number of Inputs to a Logic Cluster Size of Logic Cluster New Parameters for Dual-Output structure Ratio of shared inputs between two sub-luts Number of inputs of a dual-output LUT Largest unfractured LUT size

Architecture Parameter Description 公司 LOGO Benchmark circuits CAD Flow Technology mapping Use VTR Benchmarks as the evaluation objects; 1. Merge small LUTs 2. Pack LUTs and Registers Use Berkeley s ABC to compelete the mapping; Use VPR to do the physical synthesis; Several iterations will be done to get the final results of delay and area; Area and Delay Results Placement Routing Minimum Channel Width? Re-routing with 1.3X minimum channel width

Area and Delay Models 1 Model for LUTs Single-Output structure use full custom design based on a commercial 40nm technology Tsingle: Delay of a single LUT Asingle: Area of a single LUT Dual-Output structure since value of R varies, it s dual impossible Tsingleto implement all by hands Use a model from single-output LUT dual single mux mux T A A A Number

Area and Delay Models 2 Model for Crossbar Besides LUTs, the input crossbar in a logic cluster also takes a large area and delay. A semi-analytical method is used by choosing several points of input numbers to implement and using these data to fit equations to estimate delay 0.88 X T 0.28I X + 109.9 A NJ I I 2 A I A 1 2 1 2 X X X SRAM X buf

Area and Delay Models 3 Flip-flop and Output Multiplexer No difference between two architectures of single-output and dual-output LUT based FPGA use full custom design to get delay and area Routing Architecture Use classical island-style routing architecture with unidirectional singledriver wire segment parameters are extract from our own implementation

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

Experimental Groups by R The two figures on the right shows two extreme value of R; The specific range of R is different with different LUT size. When LUT size is large, it becomes hard to evaluate all possible value of R; Thus, we choose four representative values of R, 0, 1/3, 2/3 and 1, to explore, and they give results to capture the essential impact of different number of shared inputs ; R=0 R=1

From the figures, we can see that the values of I still show linear relationship with K and N, while only specific values differ with different values of R Number of Inputs to a Logic Clutster Number of Inputs to a logic cluster is a very important parameter for FPGA exploration, since it directly determines logic utility, area and delay of Logic-Cluster; Using the lowest value of I that provides 98% of max. cluster utilization is appropriate* For single-output, I = K(N+1)/2 R=0 R=1/3 We study the best I with relations of K and N under different R for dual-output: *E. Ahmed, J. Rose, The effect of LUT and cluster size on deepsubmicron FPGA performance and density, VLSI System, IEEE trans. On,12(3),288-298,2004 R=2/3 R=1

Estimation of Area The chip area of FPGA is always an important metric in FPGA architecture design. In our experiments, the area is measured in terms of total number of minimum-width transistors required to implement both logic and routing resources

Area versus R The figure on right side illustrates the best ratio of shared inputs at each combination of K and N: There does not exist a ratio of shared inputs that can be applied universally for all architectures to get the best area result The best value of R for minimum area increases with the growth of K and N Except for some small values of K and N, R = 2/3 is the best choice for 78% of all cases

Area Under Best R The figure shows the value of Area with different K and N while the best value of R is choosen A small LUT size with a large size of cluster is the preferred combination for areaefficiency.

Area Breakdown To better understand what s different between dualoutput and single-output LUT, we breakdown the total area into Logic and Routing parts: Serveral Observations can be obtain: Routing area is less important with the growth of LUT size. When LUT size reaches 6, logic area becomes the dominant part of total area Dual-output architecture gains most benefits from the logic part while only a little improvement in routing area. Dual-output architecture gives better area-efficiency under large LUT sizes Dual-Output Logic: Black Solid Route: Black Dashed Single-Output Logic: Red Solid Route: Red Dashed

Delay versus R The figure shows the distribution of best choice of R under different K and N: Different from area, the pattern for best value of R is not clear in the figure, indicates that performance has a weak correlation with the ratio of shared inputs. Although the statistic data gives an expression of distribution of best R, all deviations of delay for the four R s are less than 10%, which proves that the value of R has little impact on delay.

Delay Under Best R Similar to the area, right figure illustrates the tendency of delay with different combinations of K and N while the best R is chosen at each point: performance requires large sizes of LUTs and clusters which show different relation from area

Delay Comparison Dual-Output: black Single-Output: red Observations: The difference between dual-output and traditional single-output architecture is not that obvious; In statistic, deviations of total delay between the two architectures are less than 6.5%; Therefore, unlike area-efficiency, performance is not obviously improved by dual-output architecture;

Area-Delay-Product versus R In previous discussion, show two opposite tendency: The figure illustrates best R at different area-efficiency combination perfers of K small and LUT N size performance requires a LUT size as large as possible difference in delay between four R s is small, as a result, area is the dominant Area-delay product is a way used to make a trade-off factor in area-delay product between area and performance the distribution of R s for area-delay product is similar to the distribution for area. However, the delay metric still plays a role in the decision of R, thus the portion of R = 1/3 increases due to delay benefits

Area-Delay-Product Comparison Dual-Output: black Single-Output: red Advantage of Dual-Output architecture becomes more obvious with the growth of LUT size. The best area-delay products comes at a LUT size of 5 to 6, which is larger than the traditional results of 4.

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

Conclusion We list the summary of best parameters for different design goals: Criteria Single-Output Dual-Output K N K N R Area 4 7 to 15 4 to 5 7 to 11 1/3 to 2/3 Delay 9 7 to 15 9 6 to 15 0 to 1/3 Area-Delay 4 11 to 15 5 8 to 12 1/3 to 2/3 There does not exist a single ratio of shared inputs giving advantages in any size of LUT and cluster. Dual-output is an efficient way to reduce area cost in large LUT size while not a good way to improve performance.

Suggestions and discussions are welcome by email: yanghg@mail.ie.ac.cn Thank you!