EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall PDF Free Download

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic Prof. Sherief Reda School of Engineering, Brown University Fall 2014 1

Contents 1. Architecture of modern FPGAs Programmable interconnect Programmable logic blocks 2. How to design FPGAs? 3. Case studies 2

1. FPGA architecture Programmable interconnect Programmable logic blocks [Maxfield 04] Programmable logic element Objective: study organization of programmable logic blocks and interconnects 3

Block logic element (BLE) [Rose 04] [Maxfield 04] How is the number of bits in a K-input table? How many Boolean functions can a K-input LUT implement? What is the best LUT size? 4

A closer look at the BLE 5

Larger circuits needs to be decomposed into a number of BLEs [Figure from Cong FPGA 01] 6

Example [from J. Zambreno] F = A 0 A 1 A 3 + A 1 A 2 Ā 3 + Ā 0 Ā 1 Ā 2 4-input LUT 3-input LUT 2-input LUT 7

Logic block clusters (logic array block LAB, configurable logic block CLB) Assume K-input LUT in each BLE and assume N BLEs per logic cluster The BLEs in each logic clusters are fully connected or nearly-fully connected Why I is less than K N? [Betz-Rose 97] 8

Heterogeneous reconfigurable logic Reconfigurable fabric might contain non-reconfigurable elements that interface to the logic blocks through the programmable interconnect fabric Examples: Embedded memory Embedded multipliers, adders, MAC Embedded processors 9

Embedded memory blocks Costly to implement memory with configurable logic blocks add hard chunks of RAM blocks Position/size vary depending on the FPGA device. Size varies from few thousands (or tens of thousands) per RAM block [Maxfield 04] Each block can be used independently or combined to form larger RAM blocks Could be single or dual-port RAMs 10

Embedded multipliers and adders Multipliers are inherently slow if implemented by connecting a large number of programmable logic blocks add hard-wired multiplier blocks Typically located close to the embedded RAM blocks Some FPGA use Multiply-And- Accumulate (MAC) blocks (useful in DSP applications) 11

Programmable routing Wires provide the necessary communication fabric to route the output of one computational node to the inputs of another computational node Why routing is very crucial? Routing resources occupy a larger area than logic resources in an FPGA Wire delay grows quadratically as a function of its length Technology scaling reduces device delay but increases wire delay 12

General routing definitions track channel segment CLB CLB CLB CLB A wire segment is a wire unbroken by programmable switches A track is a sequence of one or more wire segments in a line. The segments could be connected by switches at their ends A routing channel is a group of parallel tracks. The channel width is the number of tracks in the channel 13

Connection blocks: formed where CLB input or output pins connect to the routing channels Life would have been easy if only logic blocks within the same column or row need to communicate! 14

Segment-segment switch design for bidirectional wires track channel segment CLB CLB CLB CLB [Lemieux 04] 15

Switch blocks: formed wherever horizontal and vertical channels intersect Switch box Switch box size grows quadratically as a function of the number of its input wires 16

Bidirectional switch details [Lemieux 04, Tessier] 17

Segmented and hierarchical routing segmented routing hierarchical routing Short wires accommodate local traffic Short wires can be connected together using switch boxes to emulate longer wires Also contain long wires to allow efficient communication without passing through switches Routing within a group of logic blocks occur at the local level Longer hierarchical wires connect different groups 18

2. How to design an FPGA? Key design parameters: K à design of BLE N à number of BLEs per LAB I à external input connection to LAB W à Number of wire tracks in a channel Goals: design area performance area-delay product power 19

Methodology [from Ahmed & Rose 04] Need an architectural design flow that use CAD tools together with benchmark circuits 20

Determine W (#tracks/channel) To determine W for a given K, N and I: continuously route each circuit, removing tracks from the architecture until routing falls Add 30% more tracks to the minimum track count and then perform final low stress routing, and use that to measure the critical path delay 21

How to determine I given K and N? [from Ahmed & Rose 04] Experiments aimed for 98% LUT utilization I should be a function of (the LUT size) and (the number of LUTs in a cluster). Larger I implies larger and slower MUXs I is less than K N; why? Empirically I= K (N+1)/2 22

FPGA area as a function of K and N? [from Ahmed & Rose 04] LUT sizes of 4 and 5 are the most area efficient for all cluster sizes. There is a reduction in total area as the cluster size is increased from 1 to 3 for all LUT sizes. However, as clusters are made larger, there is very little impact on total FPGA area. Why? Increasing K and/or N increases intra-cluster area but reduces intercluster area. 23

Impact of LUT size (K) and LUTs per cluster (N) on intra-cluster area? [from Ahmed & Rose 04] Two reasons intra-cluster area increases with K: 1. The logic block area grows exponentially with LUT size as there are 2 k bits in a k-input LUT. 2. Larger LUT sizes require larger intra-cluster multiplexers because the size of each multiplexer is I + N = K(N + 1)/2+N 24

Impact of LUT size (K) and LUTs per cluster (N) on inter-cluster area? [from Ahmed & Rose 04] The inter-cluster routing area decreases in a linear fashion with increasing LUT size. 25

Final impact on area + = [from Ahmed & Rose 04] Early increases in K (2à 5) leads to modest increase in intra-cluster area with steady reductions in inter-cluster areas. Subsequent increases in K leads to large intra-cluster area increases that offset inter-cluster area reductions 26

Impact of K and N on performance As the LUT and cluster size increase Delay of the LBE and the delay through a single cluster increases; Number of LBEs and clusters on the critical path decreases + = [from Ahmed & Rose 04] K = 6 seems the best for performance 27

Adaptive LUTs Motivation: A fixed K addresses for the average behavior of circuits à Some circuits benefit from higher K and some others benefits from lower K. Higher K (K = 6) improves performance but lower K (e.g., K=4) is better for area Can we have adaptive LUTs (i.e.., sometimes used as 6- input LUTs but sometimes used as multiple 4-input LUTs?)? Composable LUTs Fracturable LUTs 28

Composable LUTs Composable 6-LUT constructed from 4-LUTs All pins are independent à How many inputs pins are required more than a standard 6-input LUT? What is the cost of a pin? 29

Fracturable LUTs Example of (6, 2) fracturable LUT Sharing input pins improves area overhead but reduces logic flexibility (k=6, m=2) LUT can implement Any 6-input function Two 5-input functions (must share two inputs) Any two 4-input functions 30

Programming the FPGA Configuration data in Configuration data out = I/O pin/pad = SRAM cell Configuration memory that determine the programmability of the logic blocks and interconnects 31

Programmable switch technology Anti-fuse SRAM Switch by default is OFF; when programmed it is ON. Advantages: negligible delay small area overhead Disadvantages: not really reconfigurable; one time programmable Flash Switch by default is ON; when programmed it is OFF. Advantages: programming not lost when device is turned off. Disadvantages: requires more manufacturing steps SRAM bit cell stores the programmability of the device Advantages: can be reconfigured quickly and as repeatedly as required no special fabrication steps Disadvantages: takes more area loses charge when turned off 32

3. Case study: Altera s Cyclone II device Two dimensional array of Logic Array Blocks (LABs), with 16 Logic Elements (LEs) in each LAB. Embedded memory blocks (M4K) and multipliers (18x18) PLL (Phased Locked Loops) are used to generate clock signal for a range of frequencies EP2C35 (in DE2 board) has 60 columns and 45 rows for a total of 33216 LEs. 105 M4K blocks and 35 embedded multipliers. 33

Logic element organization (normal mode) The LE has two operating modes: normal and arithmetic Normal mode is suitable for general logic implementation 4-input LUT 6 input connections 3 output connections LAB-wide synchronous/asynchronous clear and load signals. Clock signal 34

Logic element organization (arithmetic mode) Arithmetic mode is suitable for implementing adders, counters, accumulators and comparators The LUT is split into two 3-input LUTs (ideal for implementing 2-bit full adders) and basic carry chain 35

Logic array block organization Each LAB consists of the following: 16 LEs, LAB control signals, LE carry chains, register chains and local interconnects Local interconnects transfer signals between LEs in the same LAB and is driven by column and row interconnects and LE outputs within the same LAB Neighboring LABs, PLLs, M4K RAM and multipliers from the left and right can also drive an LAB s local interconnect Each LE can drive 48 LEs through fast local and direct interconnects 36

Multi-track interconnects Multitrack interconnect consists of row (directlink, R4, R24) and column (register chain, C4, C16) R4/C4 interconnects spans 4 blocks (right, left / top, down) R24/C16 spans 24/16 blocks and connects to R4/C4 interconnects R4/C4 can drive each other to extend their range 38

C4 interconnections C4 interconnects drive local and R4 interconnect up to 4 rows C16 column interconnects span 16 LABs and provide long column connections C16 column interconnects indirectly drive LAB local interconnects via C4 and R4 and interconnects 39

Embedded RAMs and multipliers 4608 RAM bits (w or w/o parity) 250 MHz performance Either single or dual port memory Can also be configured as FIFO ideal for DSP applications 250 Mhz performance Either configured as one 18 bit multiplier or two independent 9 bit multipliers 40

IO Element (IOE) structure IO Element (IOE) structure (allows bidirectional signals) 5 IOE per row I/O block Row I/O blocks drive C4, R4, R24 or direct link interconnects. Column I/O blocks drive C4, C16 interconnects 41

Summary Architectural makeup of FPGAs: LUTS, routing MUXes, switches, multipliers, memory, etc. Design and area overhead of programmable LUTs and interconnects FPGA design methodology Impact of FPGA design parameters on area and delay à choosing optimal parameters Case study on Altera Cyclone II FPGA 42

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014