ISPD 2017 Contest Clock-Aware FPGA Placement

Size: px

Start display at page:

Download "ISPD 2017 Contest Clock-Aware FPGA Placement"

Lewis Rudolf Atkinson
6 years ago
Views:

1 ISPD 2017 Contest Clock-Aware FPGA Placement Stephen Yang, Chandra Mulpuri, Sainath Reddy, Meghraj Kalase, Srinivasan Dasasathyan, Mehrdad E. Dehkordi, Marvin Tom, Rajat Aggarwal

2 Acknowledgement Xilinx Vivado Management Team Support from Dr. Sudip Nag and Dr. Salil Raje Support from Xilinx Lab

3 Outline Background Top-5 Team Presentations Benchmarking Results Award Ceremony

4 Last Year: Routability-Driven FPGA Placement First FPGA related contest Latest FPGA architecture Vivado: Industrial flow for evaluation Academic benchmark format: bookshelf Focus: FPGA legalization rule and routing congestion

5 This Year: Clock-Aware FPGA Placement Continuous Effort on FPGA Placement Problem Clock Legalization: Key Constraint in FPGA Placement Wirelength as the primary metric Reduced difficulty on routability, reduced runtime factor

6 Contest Timelines Oct 2016: Problem definition and contest planning Nov 2016: Contest Announcement Dec 12, 2015: Sample benchmarks ready Jan 15, 2017: Registration deadline Feb 3, 2017: Evaluation flow ready Feb 15, 2017: Alpha submission Mar 9, 2017: Final submission Mar 10-12, 2017: Benchmarking Mar 22, 2017: Announce winners at ISPD Page 6

7 Registration: 13 Teams Team Affiliation Region VDAplacer National Chiao Tung University Asia UTPlaceF2.0 University of Texas at Austin North America WicilPlacer University of Wisconsin-Madison North America RippleFPGA Chinese University of Hong Kong Asia Uni-Placer Ulsan National Institute of Science and Technology Asia CECA_Placer Peking University Asia NTUfplace National Taiwan University Asia GPlace University of Guelph North America BMTIplacer Beijing Microelectronics and Technology Institute Asia AggiePlace Texas A&M University North America UFRGSPlace Universidade Federal do Rio Grande do Sul South America POCA Tool Politecnico di Torino, Torino, Italy Europe Kapees Indian Institute of Technology, Guwahati Asia

8 Final Submission: 9 Teams Team Affiliation Region VDAplacer National Chiao Tung University Asia UTPlaceF2.0 University of Texas at Austin North America WicilPlacer University of Wisconsin-Madison North America RippleFPGA Chinese University of Hong Kong Asia CECA_Placer Peking University Asia NTUfplace National Taiwan University Asia GPlace University of Guelph North America BMTIplacer Beijing Microelectronics and Technology Institute Asia UFRGSPlace Universidade Federal do Rio Grande do Sul South America Congratulations!

9 Target FPGA: Xilinx UltraScale VU095 20nm Technology 1.2M Logic Cell Page 9

10 Clock Routing Architecture Page 10

11 Clock Region Rule distinct clocks per region Page 11

12 Half Column Rule 12 distinct clocks per half column Page 12

13 (Hidden) Benchmark Statistics Design #LUTs #FFs #BRAMs #DSPs #I/O #Clocks Design1 215K (40%) 236K (22%) 170 (10%) 75 (10%) Design2 215K (40%) 236K (22%) 170 (10%) 75 (10%) Design3 242K (45%) 270K (25%) 255 (15%) 112 (15%) Design4 268K (50%) 300K (28%) 340 (20%) 150 (20%) Design5 295K (55%) 325K (30%) 425 (25%) 187 (25%) Design6 322K (60%) 354K (33%) 510 (30%) 225 (30%) Design7 350K (65%) 384K (36%) 595 (35%) 262 (35%) Design8 376K (70%) 414K (38%) 680 (40%) 300 (40%) Design9 392K (73%) 431K (40%) 765 (45%) 337 (45%) Design10 408K (76%) 449K (42%) 850 (50%) 375 (50%) Design11 424K (79%) 450K (43%) 900 (53%) 397 (53%) Design12 440K (82%) 484K (45%) 950 (56%) 420 (56%) Design13 456K (85%) 503K (47%) 1000 (59%) 442 (59%) Largest: 1.0M instances, 57 clocks Page 13

14 Placer Evaluation Flow Design (bookshelf) Design (Xilinx DB) Load Design Vivado Contest Placer.pl file Read Placement Clock and Legality Check Routing Routed WL Page 14

15 Evaluation Metrics and Ranking Score = Routed-WL * (1 + Runtime_Factor) Runtime Factor 20% runtime -> 1% QoR Bounded by +/- 2.5% Failures Routing-Failures > Legalization-Failures > Placer-Failures Ranking per design: 1, 2, 3,, n Sum-of-the-rankings of each team

16 Top-5 Team Presentation

17 Top-5 Teams (In Alphabetical Order) GPlace, University of Guelph, Ziad Abuowaimer NTUfplace, National Taiwan University, Yun-Chih Kuo RippleFPGA, Chinese University of Hong Kong, Gengjie Chen UTPlaceF2.0, University of Texas, Austin, Wuxi Li VDAplacer, National Chiao Tung University, Chen Chen

18 Top-5 Teams (In Alphabetical Order) GPlace, University of Guelph, Ziad Abuowaimer NTUfplace, National Taiwan University, Yun-Chih Kuo RippleFPGA, Chinese University of Hong Kong, Gengjie Chen UTPlaceF2.0, University of Texas, Austin, Wuxi Li VDAplacer, National Chiao Tung University, Chen Chen

19 GPlace 2.0: Clock-Aware Placement Tool for UltraScale FPGAs Ziad Abuowaimer Shawki Areibi Anthony Vannelli University of Guelph March 22, 2017 Gary Grewal

20 Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Global Placement (Congestion-Driven) Star+ Solver Site & Clock Legalization Overlap Bbox of Clock Signals NO YES <= 24 placement.pl 20

21 Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Global Placement (Congestion-Driven) Star+ Solver Site & Clock Legalization Overlap Bbox of Clock Signals Pin-Propagation Preplacement (Similar to GPlace 1.0) NO YES <= 24 placement.pl 21

22 Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Global Placement (Congestion-Driven) Star+ Solver Site & Clock Legalization Overlap Bbox of Clock Signals NO YES <= 24 placement.pl 22

23 Preplacement Global Placement (WL-Driven) Star+ Solver Analytical Placement (Star+ and Jacobi): = = = = Site & Clock Legalization = + : : 23

24 Preplacement Global Placement (WL-Driven) FF Legalization: (Objective is WL minimization) Use Bipartition Legalization in three levels: Star+ Solver FF Legalization First partition the FPGA into Clock Regions and recursively bipartition FFs into those clock regions. Clock-Region Bipartition Half-Column Bipartition Site Bipartition Second, partition each Clock-Region into half-columns and recursively bipartition FFs into those half-columns. Third, partition each half-columns into sites and recursively bipartition FFs into those sites. 24

25 Preplacement Global Placement (WL-Driven) Star+ Solver FF Legalization Create a Recursive bi-partitioning tree data structure for the 40 Clock Regions. Each node in the tree contains: Site capacity. Clock Capacity. Clock-Region Bipartition Half-Column Bipartition Site Bipartition 25

26 Preplacement Global Placement (WL-Driven) Star+ Solver #Groups CR0 #Slices #Groups #Sub-groups RG0 CR1 CE0 CE1 CE0 Tree structure Maintain Sites and Control-Set Capacity constraints. FF Legalization Clock-Region Bipartition 9 #FFs 5 17 Half-Column Bipartition Site Bipartition CS0 RG0 CS1 Tree structure Maintain Clock Signals Capacity Constraints 9 FFs 17 FFs 26

27 Preplacement Global Placement (WL-Driven) Star+ Solver FF Legalization # Clocks & Clocksids FPGA-Clock-Region-Tree: A tree data structure that stores # of Clocks and Clocks ids At each node after FF legalization Level 1. Clock-Region Bipartition Half-Column Bipartition Site Bipartition 27

28 Preplacement Global Placement (WL-Driven) Star+ Solver FF Legalization Clock-Region Bipartition Half-Column Bipartition Create a Recursive bi-partitioning tree data structure of the half-columns within each Clock Region. (Actually we need only 3 Trees since we have 3 different patterns). Each node in the tree contains: Site capacity. Clock Capacity. Site Bipartition 28

29 Preplacement Global Placement (WL-Driven) Star+ Solver FF Legalization Tree: Clock Capacity Tree: Site & Control-Set Capacity Clock-Region Bipartition RG0 #Slices RG0 Half-Column Bipartition CS0 CS1 #Groups CR0 Site Bipartition 9 FFs 17 FFs CE0 #Sub-groups CE1 9 #FFs 5 29

30 Preplacement Global Placement (WL-Driven) Star+ Solver FF Legalization Clock-Region Bipartition Half-Column Bipartition Site Bipartition FPGA-Half-Column-Tree: A tree data structure that stores # of Clocks and Clocks ids At each node after FF legalization Level 2. 30

31 Preplacement Global Placement (WL-Driven) Star+ Solver FF Legalization Clock-Region Bipartition Half-Column Bipartition Site Bipartition Tree: Site & Control-Set Capacity Create a Recursive bipartitioning tree data #Slices RG0 structure of the Sites within each half-column. #Groups Each node in the tree contains: Site capacity. CE0 CR0 #Sub-groups CE1 9 #FFs 5 31

32 Preplacement Global Placement (WL-Driven) Star+ Solver DSP Legalization Clock-Region Bipartition Half-Column Bipartition Site Bipartition DSP Legalization: (Similar to FF legalization but without Control-Set Constraints) Use Bipartition Legalization in three levels: First partition the FPGA into Clock Regions and recursively bipartition DSPs into those clock regions. (Use and update FPGA-Clock-Region-Tree). Second, partition each Clock-Region into half-columns and recursively bipartition DSPs into those half-columns. (Use and update FPGA-Half-Column-Tree). Third, partition each half-columns into sites and recursively bipartition DSPs into those sites. 32

33 Preplacement Global Placement (WL-Driven) BRAM Legalization: (Similar to DSP legalization) Use Bipartition Legalization in three levels: Star+ Solver BRAM Legalization First partition the FPGA into Clock Regions and recursively bipartition BRAMs into those clock regions. (Use and update FPGA-Clock-Region-Tree). Clock-Region Bipartition Half-Column Bipartition Second, partition each Clock-Region into half-columns and recursively bipartition BRAMs into those half-columns. (Use and update FPGA-Half-Column-Tree). Site Bipartition Third, partition each half-columns into sites and recursively bipartition BRAMs into those sites. 33

34 Preplacement v Adjust the Global Routing Grid Capacity. Global Placement (WL-Driven) Star+ Solver v Run NCTU-gr 2.0 Global Router to get the congestion estimation. Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 v Inflate LUTs based on both # of pins and congestion value: = ( ) Ratio is based on Congestion Value. LUT inflation 34

35 Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation 35

36 Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Calculate the center of gravity for each Clock Signal based on the position of its Clock Loads. (Ignore The two Global Clock Signals ControlSig0 & ControlSig1) Clock-Loads Assignment 36

37 Clock-Signals Partitioning Clock-Loads Center of Gravity Find a bounding box that contains all center of gravity points. Bbox of Center of Gravity Clock-Loads Assignment 37

38 Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Assign each Clock Loads to the closest corner based on the distance of its center of gravity to that corner. Limit each partition to have 20 different Clocks maximum. Clock-Loads Assignment 38

39 Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Place each partition to the corresponding FPGA corner. Place the inflated LUTs in the middle of the FPGA. Clock-Loads Assignment LUTs 39

40 (Congestion-Driven) Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Global Placement (Congestion-Driven) Star+ Solver Site & Clock Legalization Overlap Bbox of Clock Signals Similar to Global Placement (WL-Driven) but with inflated LUTs. NO YES <= 24 placement.pl 40

41 Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Global Placement (Congestion-Driven) Star+ Solver Site & Clock Legalization Overlap Bbox of Clock Signals NO YES <= 24 placement.pl 41

42 Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Global Placement (Congestion-Driven) Star+ Solver Site & Clock Legalization Overlap Bbox of Clock Signals NO YES <= 24 placement.pl 42

43 Preplacement Global Placement (WL-Driven) Star+ Solver Site & Clock Legalization Congestion Estimation Adjust Global Routing Grid NCTU-gr 2.0 LUT inflation Clock-Signals Partitioning Clock-Loads Center of Gravity Bbox of Center of Gravity Clock-Loads Assignment Global Placement (Congestion-Driven) Star+ Solver Site & Clock Legalization Overlap Bbox of Clock Signals NO YES <= 24 placement.pl 43

44 Top-5 Teams (In Alphabetical Order) GPlace, University of Guelph, Ziad Abuowaimer NTUfplace, National Taiwan University, Yun-Chih Kuo RippleFPGA, Chinese University of Hong Kong, Gengjie Chen UTPlaceF2.0, University of Texas, Austin, Wuxi Li VDAplacer, National Chiao Tung University, Chen Chen

45 NTUfplace Clock-Aware FPGA Placement Yun-Chih Kuo, Chau-Chin Huang, Shih-Chun Chen, Chun-Han Chiang, Yao-Wen Chang, and Sy-Yen Kuo Mar. 22, 2017 National Taiwan University 45

46 Outline Introduction Proposed Approach Experimental Results Demo 46

47 Outline Introduction Proposed Approach Experimental Results Demo 47

48 bin Analytical Placement Formulation Given the chip region and block dimensions, determine (x, y) for all movable blocks min s.t. W( x, y ) // wirelength function D b ( x, y ) M b D b : density for bin b M b : max density for bin b Density = A block A bin Relax the constraints into the objective function (penalty) min W( x, y ) + λ ( max( D b ( x, y ) M b, 0 ) ) 2 Apply differentiable wirelength and density models Use the gradient method to solve the optimization problem Increase λ gradually to meet density constraints 48

49 Differentiable Wirelength and Density Models Log-sum-exp wirelength model [Naylor et al., 2001] ¾ An effective smooth and differentiable function for HPWL approximation; this model achieves exact HPWL when γ à 0 Bell-shaped density model [Kahng et al., ICCAD 04] ℎ ℎ (, ) (, )

50 Multilevel Global Placement Cluster the blocks based on connectivity/size to reduce the problem size clustering clustering Initial placement Iteratively decluster the clusters and further refine the placement declustering & refinement declustering & refinement clustered block chip boundary 50

51 Outline Introduction Proposed Approach Experimental Results Demo 51

52 Clock-Aware Multilevel Global Placement Cluster blocks with clock constraint Initial placement clustering clustering declustering & refinement declustering & refinement clustered block chip boundary Blocks within same clock domain 52

53 Mismatch between GP and LG Analytical model for global placement gives continuous solutions while legalization pulls blocks to discrete and scattered legal locations Displacement of blocks is large I/O block DSP CLB RAM 53

54 Heterogeneous Cost Function Therefore, we can solve this with gradient method: min W( x, y ) + λ 1 ( max( D b ( x, y ) M b, 0 ) ) 2 + λ 2 G(x) Cost of complex-block-alignment function Smoothed cost DSP columns 54

55 Clocking Resource Constraint We formulate the clocking resource constraint in clock regions as a cost in the placement stages Therefore, we can resolve the clocking resource constraint by moving blocks out of resource-lacking regions Clock Region 55

56 Outline Introduction Proposed Approach Experimental Results Demo 56

57 Experimental Results We ran our program on an Intel Xeon E CPU with 32GB memory Design #nodes #nets Routed-WL Runtime clk_design s clk_design m41s clk_design m11s clk_design m1s clk_design m57s 57

58 Outline Introduction Proposed Approach Experimental Results Demo 58

59 Demo 59

60 Thank You! 60

61 Top-5 Teams (In Alphabetical Order) GPlace, University of Guelph, Ziad Abuowaimer NTUfplace, National Taiwan University, Yun-Chih Kuo RippleFPGA, Chinese University of Hong Kong, Gengjie Chen UTPlaceF2.0, University of Texas, Austin, Wuxi Li VDAplacer, National Chiao Tung University, Chen Chen

62 CUHK - RippleFPGA Gengjie Chen, Chak-Wa Pui, Evangeline F. Y. Young, Bei Yu March 22, 2017

63 Outline Background Our Flow How We Handle Clock Rules Clock region Half column

64 Background Hetergenous FPGA I/O CLB RAM DSP Switch Box

65 Background Configurable Logic Block (CLB) Basic Logic Element (BLE) BLE 0 LUT 0 CK0 SR0 CE0 FF 0 upper half using CK0, SR0, CE0/1 BLE 1 BLE 2 LUT 1 FF 1 CK0 SR0 CE1 CLB BLE 3 BLE 4 lower half using CK1, SR1, CE2/3 BLE 5 BLE 6 LUT 14 LUT 15 BLE 7... CK1 SR1 CE2 FF 14 FF 15 CK1 SR1 CE3

66 Outline Background Our Flow How We Handle Clock Rules Clock Region Half Column

67 Flows in Previous Work packing flat netlist pack-place placement LUT/FF BLE CLB placed design Convectional flow (pack-place) Packing based on physical information (place-packplace): Un/DoPack [ICCAD 06], HDPack [FPL 07], UTPlaceF [ICCAD 16], GPlace-pack [ICCAD 16] Flat placement followed by legalization (place-pack): GPlace-flat [ICCAD 16] place-pack-place place-pack

68 Our Flow placement flat netlist packing flat netlist 1 flat GP LUT/FF BLE 5 CLB placed design soft BLE packing BLE GP CLB physical packing (LG) 45 two-level DP 5 slot assignment in CLB placed design

69 Our flow Features Stair-step flow which interleaves packing and placement Implicit CLB packing similar to ASIC LG (Tetris) Strengths Feedback quickly Iteratively improve other metrics (congestion, timing, power etc) Approximate analytical GP directly Smoothly control packing density Easily embed other metrics Easily consider some constraints (e.g., clock rules)

70 Outline Background Our Flow How We Handle Clock Rules Clock region Half column

71 Clock Rules Clock region ~32x60 sites => global A clock occupies a clock region if its bounding box (BB) does <= 24 clocks in each Half column 2x30 sites => local <= 12 clocks in each

72 Clock Region Clock region ~32x60 sites => global <= 24 clocks in each Solution Plan clock regions Apply it to GP, LG, DP

73 Clock Region Planning Clock bounding box (CBB): restrict the movement of cells of the same clock to a bounding box Shrinking: reduce overflow in clock region iteratively until no Expanding: reduce cell density in CBB iteratively until impossible

74 Clock Region Planning Assume 3x3 clock regions <= 2 clocks in each clock region 4 clocks 1 1 The CBB of a clock 1 1

75 Clock Region Planning Assume 3x3 clock regions <= 2 clocks in each clock region 4 clocks

76 Clock Region Planning Assume 3x3 clock regions <= 2 clocks in each clock region 4 clocks

77 Clock Region Planning Assume 3x3 clock regions <= 2 clocks in each clock region 4 clocks

78 Clock Region Planning Assume 3x3 clock regions <= 2 clocks in each clock region 4 clocks Overflow: #clk = 4 >

79 Clock Region Planning Shrinking: reduce overflow in clock region iteratively until no For clock region with max overflow Calculate total cell displacement when shrinking Select CBB & direction with min displacement and do

80 Clock Region Planning Shrinking: reduce overflow in clock region iteratively until no

81 Clock Region Planning Shrinking: reduce overflow in clock region iteratively until no

82 Clock Region Planning Shrinking: reduce overflow in clock region iteratively until no It s legal now! 1 2 1

83 Clock Region Planning Expanding: reduce cell density in CBB iteratively until impossible For unmarked CBB with max cell density Try expanding, mark if cannot

84 Clock Region Planning Expanding: reduce cell density in CBB iteratively until impossible

85 Clock Region Planning Expanding: reduce cell density in CBB iteratively until impossible

86 Clock Region Planning Expanding: reduce cell density in CBB iteratively until impossible

87 Clock Region Planning Expanding: reduce cell density in CBB iteratively until impossible

88 Clock Region Planning Expanding: reduce cell density in CBB iteratively until impossible It s exhausted now! 2 2 2

89 Clock Region Plan clock region Apply it to GP, LG, DP GP: add box constraints (not implemented) LG/DP: only consider sites within CBB

90 Half Column Half column 2x30 sites => local <= 12 clocks in each Solution Resolve overflow after normal LG Forbid movement causing overflow in DP

91 Half Column Resolve overflow after normal LG For a half column with overflow Select the clock with fewest cells Move cells to neighboring overflow-free half columns with min displacement

92 Half Column Resolve overflow after normal LG

93 Half Column Resolve overflow after normal LG

94 Half Column Resolve overflow after normal LG It s legal now!

95 Summary Background Our Flow How We Handle Clock Rules Clock region Plan clock region Apply it to GP, LG, DP Half column Resolve overflow after normal LG Forbid movement causing overflow in DP

96 Top-5 Teams (In Alphabetical Order) GPlace, University of Guelph, Ziad Abuowaimer NTUfplace, National Taiwan University, Yun-Chih Kuo RippleFPGA, Chinese University of Hong Kong, Gengjie Chen UTPlaceF2.0, University of Texas, Austin, Wuxi Li VDAplacer, National Chiao Tung University, Chen Chen

97 UT DA UTPlaceF 2.0 ISPD 2017 Clock-Aware FPGA Placement Contest Wuxi Li, David Z. Pan ECE Department, University of Texas at Austin 97

98 Team Introduction Wuxi Li Ph.D. student UT-Austin David Z. Pan Professor UT-Austin UT Design Automation Lab 98

99 Outline Original UTPlaceF Flow Clock Constraints Clock Region Constraint Half Column Constraint Clock Region Assignment UTPlaceF 2.0 Flow 99

100 Original UTPlaceF Flow Circuit Wirelength-driven Phase Routability-driven Phase Flat Initial Placement Netlist Cell In ation Packing Global Placement Quadratic Programming + Rough Legalization Quadratic Programming + Rough Legalization Legalization No Almost Converged? Yes Legalize DSP, RAM, I/O No Converged? Detailed Placement Yes FIP Done Done 100

101 Clock Region Constraint The FPGA is divided into 5 by 8 clock regions Clock demand of each clock region

102 Half Column Constraint Each clock region is divided into half column regions Clock demand of each half column region

103 Clock Region Assignment Problem Inputs A rough legalized placement Outputs Cells to clock region assignment with minimized total cell movement Capacity constraint is satisfied for each clock region Clock demand 24 for each clock region 103

104 Problem Transformation 104

105 Algorithm Overview 105

106 Min-Cost-Max-Flow Based Assignment 106

107 UTPlaceF 2.0 Flow Circuit Wirelength-driven Phase Routability & Clock Driven Phase Flat Initial Placement Netlist Cell In ation Clock-Aware Packing Clock Region Assign. + Global Placement Clock Region Assign. + Half Column Assign. + Legalization Clock-Aware Detailed Placement No Quadratic Programming + Rough Legalization Almost Converged? Yes Quadratic Programming + Clock Region Assign. + Rough Legalization Legalize DSP, RAM, I/O Converged? Yes FIP Done No Done 107

108 Thanks! 108

109 Top-5 Teams (In Alphabetical Order) GPlace, University of Guelph, Ziad Abuowaimer NTUfplace, National Taiwan University, Yun-Chih Kuo RippleFPGA, Chinese University of Hong Kong, Gengjie Chen UTPlaceF2.0, University of Texas, Austin, Wuxi Li VDAplacer, National Chiao Tung University, Chen Chen

of Electronic Engineering, National Chiao Tung University 2017/3/22

110 VDAplacer ISPD 2017 Contest Clock-Aware FPGA Placement Presenter: Chen Chen Advisor: Prof. Hung-Ming Chen Dept. of Electronic Engineering, National Chiao Tung University 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 110

111 Outline Problem Formulation FPGA Packing Problem Clock-Aware Heterogeneous Placement Proposed Algorithm Dynamic Packing with physical information Global Placement Placement Migration Legalization and Detailed Placement 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 111

112 Outline Problem Formulation FPGA Packing Problem Clock-Aware Heterogeneous Placement Proposed Algorithm Dynamic Packing with physical information Global Placement Placement Migration Legalization and Detailed Placement 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 112

$controlling signals and the fracturable LUT constraints. A configurable logic block (CLB) contains 8 fracturable LUTs, 16 FFs, 2 clock inputs (CLK), 2 set/reset inputs (SR),4 clock enables (CE).$

113 FPGA Packing Problem The FPGA packing problem is to cluster LUTs and FFs into groups to minimize the total number of blocks and block interconnections while satisfying the limitations of the FF controlling signals and the fracturable LUT constraints. A configurable logic block (CLB) contains 8 fracturable LUTs, 16 FFs, 2 clock inputs (CLK), 2 set/reset inputs (SR),4 clock enables (CE). The CEs are independent for { FF0, FF2, FF4, FF6 }, { FF1, FF3, FF5, FF7 }, { FF8, FF10, FF12, FF14 }, { FF9, FF11, FF13, FF15 }. A Configurable Logic Block (CLB) 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 113

$FPGA Packing Problem A fracturbale LUT has three modes of operation: As single K-input LUT (K from 1 to 6) As two 5-input (or fewer input) LUTs with separate outputs but common inputs As two 3-input$

114 FPGA Packing Problem A fracturbale LUT has three modes of operation: As single K-input LUT (K from 1 to 6) As two 5-input (or fewer input) LUTs with separate outputs but common inputs As two 3-input (or fewer input) LUTs irrespective of common inputs 1 to 6 1 to 5 1 to 3 LUT LUT LUT LUT 1 to 3 LUT LUT Mode (1) Mode (2) Mode (3) 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 114

Clock-Aware Heterogeneous Placement The FPGA placement problem: Given a heterogeneous FPGA and circuit, we are to determine the desired position for each movable block to minimize the routed

115 Clock-Aware Heterogeneous Placement The FPGA placement problem: Given a heterogeneous FPGA and circuit, we are to determine the desired position for each movable block to minimize the routed wirelength such that each block is in specified regions without overlapping among the blocks. 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 115

116 Clock-Aware Heterogeneous Placement Clock-Aware Placement Constraints Number of global clocks in each clock region is at most 24 clocks. Within each clock region, each half column has at most 12 clocks. Each clock should be constrained to a continuous rectangular area. 5x8 Clock Regions (14~18)x2 Half Columns 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 116

117 Outline Problem Formulation FPGA Packing Problem Clock-Aware Heterogeneous Placement Proposed Algorithm Dynamic Packing with physical information Global Placement Placement Migration Legalization and Detailed Placement 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 117

118 Dynamic Packing with physical information Apply POLAR[1] framework Increase the force of anchor net in initial placement stage and decrease in dynamic packing stage. Packing Factor: # of Clocks # of Control Sets(C/R/CE) Distance # of Common Nets Initial Placement Solve quadratic objective function using B2B model and obtain lower bound HPWL placement using CG Obtain upper bound HPWL placement using Look Ahead Legalization (LAL) Density-Aware Global Move Upper Bound & Lower Bound Converge? YES x5 Dynamic Packing Solve quadratic objective function using B2B model and obtain lower bound HPWL placement using CG Obtain upper bound HPWL placement using Look Ahead Legalization (LAL) Density-Aware Global Move Legalized locations serve as pseudo anchors and add anchors to quadratic objective function Packing NO Legalized locations serve as pseudo anchors and add anchors to quadratic objective function NO no more good packing? YES Global Placement [1]: T. Lin, C. Chu, J. R. Shinnerl, I. Bustany, and I. Nedelchev. POLAR: Placement based on novel rough legalization and renement. ICCAD '13, /3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 118

119 Global Placement Global Placement Lower density around fixed nodes HPWL-Driven Global Placement B2B wirelength model Lower bound placement from solving quadratic objective function Upper bound placement from look-aheadlegalization Density-Aware Global Move Move to optimal region with consideration of Density Wirelength Move to clock valid location (after clock selection) Clock Selection 1. Select a initial Clock Region for each clock 2. Expand each clock s area gradually in consideration of amount of uncovered nodes 3. Unpack CLBs that cannot find any valid location Solve quadratic objective function using B2B model and obtain lower bound HPWL placement using CG Obtain upper bound HPWL placement using Look Ahead Legalization (LAL) Density-Aware Global Move Upper Bound & Lower Bound Converge? NO Routing congestion estimation Congestion-driven packing YES Legalized locations serve as pseudo anchors and add anchors to quadratic objective function Placement Migration (near converge) 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 119

120 Global Placement Routing Congestion Estimation Apply NCTUgr for estimation Congestion-driven Packing Apply further packing for overlapped but routing congestion-free area Apply unpacking for routing congested area Global Placement Lower density around fixed nodes Solve quadratic objective function using B2B model and obtain lower bound HPWL placement using CG Obtain upper bound HPWL placement using Look Ahead Legalization (LAL) Density-Aware Global Move Upper Bound & Lower Bound Converge? NO Routing congestion estimation YES Placement Migration Congestion-driven packing (near converge) Legalized locations serve as pseudo anchors and add anchors to quadratic objective function 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 120

121 Placement Migration For closing the gap between global placement and legalization : Modify the three forces balance system from Kraftwerk2 [2] Placement Migration Obtain move force by calculating cell density gradient Obtain target step size for each cell Hold force : preserve the integrity of the original placement result Net force : model the wirelength of the netlist Move force : perturb the placement and smooth the transition from global placement to legalization YES Density Overflow? NO Legalization & Detailed Placement the cell s surface model obtained by Gaussian Blurring [2]: P. Spindler, U. Schlichtmann, and F. M. Johannes. Kraftwerk2: A fast force-directed quadratic placement approach using an accurate net model. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 27(8): , Aug /3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 121

122 Legalization and Detailed Placement (1/2) Minimize displacement in legalization 1. Apply bipartite matching to each clock region for legalization 2. Select Clocks for every half column 3. Apply another bipartite matching to fit half column constraints. Legalization & Detailed Placement Legalization using bipartite matching Wirelength-driven detailed placement Placement Result 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 122

Legalization and Detailed Placement (2/2) Detailed Placement Perform the Global Swap [3] to reduce the wirelength Identify a good swap pair or a space for each cell After swapping the cell would be

123 Legalization and Detailed Placement (2/2) Detailed Placement Perform the Global Swap [3] to reduce the wirelength Identify a good swap pair or a space for each cell After swapping the cell would be in the position that gives the best wirelength while all other cells are treated as fixed Legalization & Detailed Placement Legalization using bipartite matching Wirelength-driven detailed placement Placement Result [3]: M. Pan, N. Viswanathan, and C. Chu. An efficient and effective detailed placement algorithm. In IEEE/ACM International Conference on Computer-Aided Design, pages 48 55, Nov /3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 123

124 Thank you! 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 124

125 Benchmarking Results

126 Top-5 Results: Place/Route Completion Designs Placer-A Placer-B Placer-C Placer-D Placer-E CLK-FPGA01 PASS PASS PASS PASS FAIL CLK-FPGA02 PASS PASS PASS PASS PASS CLK-FPGA03 PASS PASS PASS PASS FAIL CLK-FPGA04 PASS PASS PASS PASS FAIL CLK-FPGA05 PASS PASS PASS PASS FAIL CLK-FPGA06 PASS PASS PASS PASS FAIL CLK-FPGA07 PASS PASS PASS PASS PASS CLK-FPGA08 PASS PASS PASS PASS PASS CLK-FPGA09 PASS PASS PASS PASS PASS CLK-FPGA10 PASS PASS PASS PASS FAIL CLK-FPGA11 PASS PASS PASS PASS FAIL CLK-FPGA12 PASS PASS PASS PASS PASS CLK-FPGA13 PASS PASS PASS PASS PASS

127 Top-4 Placers: Total Routed Wirelength Designs Placer-A Placer-B Placer-C Placer-D CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA

128 Total Routed Wirelength (Normalized) Designs Placer-A Placer-B Placer-C Placer-D CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA Average

129 Placer Runtime (seconds) Designs Fastest 2nd 3rd 4th CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA Less than 10 mins for the largest design!

130 Placer Runtime (Normalized) Designs Fastest 2nd-fastest 3rd-fastest 4th-fastest CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA Average

131 Final Results with Runtime Factor Designs Placer-A Placer-B Placer-C CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA Average

132 Award Ceremony

133 Fifth Place goes to

134 5 GPlace 2.0: Clock-Aware Placement Tool for UltraScale FPGAs Ziad Abuowaimer Shawki Areibi Anthony Vannelli University of Guelph March 22, 2017 Gary Grewal

135 Fourth Place goes to

136 4 VDAplacer ISPD 2017 Contest Clock-Aware FPGA Placement Presenter: Chen Chen Advisor: Prof. Hung-Ming Chen Dept. of Electronic Engineering, National Chiao Tung University 2017/3/22 Department of Electronics Engineering, National Chiao Tung University VLSI Design Automation LAB 136

137 Third Place goes to

138 3 Fastest Placer CUHK - RippleFPGA Gengjie Chen, Chak-Wa Pui, Evangeline F. Y. Young, Bei Yu March 22, 2017

139 Second Place goes to

140 NTUfplace Clock-Aware FPGA Placement 2 Yun-Chih Kuo, Chau-Chin Huang, Shih-Chun Chen, Chun-Han Chiang, Yao-Wen Chang, and Sy-Yen Kuo Mar. 22, 2017 National Taiwan University 140

141 First Place goes to

142 UT DA Two years in a row! 1 UTPlaceF 2.0 ISPD 2017 Clock-Aware FPGA Placement Contest Wuxi Li, David Z. Pan ECE Department, University of Texas at Austin 142

143 Final Results with Runtime Factor Designs UTPlaceF2.0 NTUfplace RippleFPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA CLK-FPGA Average

144 Congratulations!

Clock-Aware FPGA Placement Contest

Clock-Aware FPGA Placement Contest Stephen Yang, Chandra Mulpuri, Sainath Reddy, Meghraj Kalase, Srinivasan Dasasathyan, Mehrdad E. Dehkordi, Marvin Tom, Rajat Aggarwal Xilinx Inc. 2100 Logic Drive San