L12: Reconfigurable Logic Architectures Acknowledgements: Materials in this lecture are courtesy of the following sources and are used with permission. Frank Honore Prof. Randy Katz (Unified Microelectronics Corporation Distinguished Professor in Electrical Engineering and Computer Science at the University of California, Berkeley) and Prof. Gaetano Borriello (University of Washington Department of Computer Science & Engineering) From Chapter 2 of R. Katz, G. Borriello. Contemporary Logic Design. 2nd ed. Prentice-Hall/Pearson Education, 2005. 1
History of Computational Fabrics Discrete devices: relays, transistors (1940s-50s) Discrete logic gates (1950s-60s) Integrated circuits (1960s-70s) e.g. TTL packages: Data Book for 100 s of different parts Gate Arrays (IBM 1970s) Transistors are pre-placed on the chip & Place and Route software puts the chip together automatically only program the interconnect (mask programming) Software Based Schemes (1970 s- present) Run instructions on a general purpose core Programmable Logic (1980 s to present) A chip that be reprogrammed after it has been fabricated Examples: PALs, EPROM, EEPROM, PLDs, FPGAs Excellent support for mapping from Verilog ASIC Design (1980 s to present) Turn Verilog directly into layout using a library of standard cells Effective for high-volume and efficient use of silicon area 2
Reconfigurable Logic Logic blocks To implement combinational and sequential logic Interconnect Wires to connect inputs and outputs to logic blocks I/O blocks Special logic blocks at periphery of device for external connections Key questions: How to make logic blocks programmable? (after chip has been fabbed!) What should the logic granularity be? How to make the wires programmable? (after chip has been fabbed!) Specialized wiring structures for local vs. long distance routes? How many wires per logic block? Inputs n Logic LogicD Configuration SET CLR Q Q m Outputs 3
Programmable Array Logic (PAL) Based on the fact that any combinational logic can be realized as a sum-of-products PALs feature an array of AND-OR gates with programmable interconnect input signals AND array OR array output signals programming of product terms programming of sum terms 4
Inside the 22v10 PAL Each input pin (and its complement) sent to the AND array OR gates for each output can take 8-16 product terms, depending on output pin Macrocell block provides additional output flexibility... Image removed due to copyright restrictions. 5
Cypress PAL CE22V10 From Lattice Semiconductor Image removed due to copyright restrictions. Images courtesy of Lattice Semiconductor Corporation. Used with permission. Outputs may be registered or combinational, positive or inverted 6
Anti-Fuse Fuse-Based Approach (Actel( Actel) Rows of programmable logic building blocks + rows of interconnect Anti-fuse Technology: Program Once Use Anti-fuses to build up long wiring runs from short segments I/O Buffers, Programming and Test Logic I/O Buffers, Programming and Test Logic I/O Buffers, Programming and Test Logic Logic Module Wiring Tracks I/O Buffers, Programming and Test Logic 8 input, single output combinational logic blocks FFs constructed from discrete cross coupled gates 7
Actel Logic Module Combinational block does not have the output FF Example Gate Mapping GND A 00 01 10 11 Y D E B C S-R Flip-Flop GND VDD S GND R VDD 00 01 10 11 Q 8
Actel Routing & Programming Courtesy of Actel. Used with permission. Precharge Phase Vpp/2 Vpp/2 Vpp/2 Input Segments Vpp/2 Inputs Outputs Gnd Vpp/2 Horizontal Channel Vpp/2 Logic Module Antifuse shorted Vpp Output Segments Long Vertical Tracks Programming an Antifuse Programming is Permanent (one time) Courtesy of Actel. Used with permission. 9
RAM Based Field Programmable Logic - Xilinx CLB CLB Slew Rate Control Passive Pull-Up, Pull-Down Vcc Switch Matrix D Q Output Buffer Pad CLB CLB Q D Delay Input Buffer Programmable Interconnect I/O Blocks (IOBs) C1 C2 C3 C4 H1 DIN S/R EC G4 G3 G2 G1 F4 F3 F2 F1 K G Func. Gen. F Func. Gen. H Func. Gen. DIN F' G' H' G' H' DIN F' G' H' H' F' S/R Control 1 S/R Control 1 SD D Q EC RD SD D Q EC RD Y X Configurable Logic Blocks (CLBs) Courtesy of Xilinx. Used with permission. 10
The Xilinx 4000 CLB Courtesy of Xilinx. Used with permission. 11
Two 4-input 4 Functions, Registered Output and a Two Input Function Courtesy of Xilinx. Used with permission. 12
5-input Function, Combinational Output Courtesy of Xilinx. Used with permission. 13
LUT Mapping N-LUT direct implementation of a truth table: any function of n-inputs. N-LUT requires 2 N storage elements (latches) N-inputs select one latch location (like a memory) Inputs Why Latches and Not Registers? Courtesy of Xilinx. Used with permission. Output Latches set by configuration bitstream 4LUT example 14
Configuring the CLB as a RAM Memory is built using Latches not FFs Courtesy of Xilinx. Used with permission. 16x2 Read is same a LUT Function! 15
Xilinx 4000 Interconnect Courtesy of Xilinx. Used with permission. 16
Xilinx 4000 Interconnect Details Wires are not ideal! Courtesy of Xilinx. Used with permission. 17
Xilinx 4000 Flexible IOB Outputs through FF or bypassed Adjust Transition Time Courtesy of Xilinx. Used with permission. Adjust the Sampling Edge 18
Add Bells & Whistles Hard Processor Gigabit Serial 18 Bit 18 Bit 36 Bit I/O Multiplier VCCIO Programmable Termination Z Z Z Impedance Control BRAM Clock Mgmt Courtesy of David B. Parlour, ISSCC 2004 Tutorial, The Reality and Promise of Reconfigurable Computing in Digital Signal Processing. and Xilinx. Used with permission. 19
The Virtex II CLB (Half Slice Shown) Courtesy of Xilinx. Used with permission. 20
Adder Implementation LUT: A B Cout A B Y = A B Cin Dedicated carry logic 1 half-slice = 1-bit adder Cin Courtesy of Xilinx. Used with permission. 21
Carry Chain Courtesy of Xilinx. Used with permission. 1 CLB = 4 Slices = 2, 4-bit adders 64-bit Adder: 16 CLBs A[63:0] B[63:0] + Y[63:0] A[63:60] B[63:60] CLB15 Y[64] Y[63:60] A[7:4] B[7:4] CLB1 Y[7:4] A[3:0] B[3:0] CLB0 Y[3:0] CLBs must be in same column 22
Virtex II Features Double Data Rate registers Digital Clock Manager Embedded Multiplier Courtesy of Xilinx. Used with permission. Block SelectRAM 23
The Latest Generation: Virtex-II Pro FPGA Fabric Embedded memories Embedded PowerPc Hardwired multipliers High-speed I/O Courtesy of Xilinx. Used with permission. 24
FPGA Evolution Summary [Parlour04] 1000 Transistors x 10 6 Logic + FF Distributed RAM Arithmetic Support DSP System Design Tools Block RAM Hard MAC Hard CPU High Speed Serial IO 10 0.1 1980 1985 1990 1995 2000 2005 Glue Logic Core Functionality Logic Platform Courtesy of Xilinx. Used with permission. System Platform Domain Specific Platform 25
Design Flow - Mapping Technology Mapping: Schematic/HDL to Physical Logic units Compile functions into basic LUT-based groups (function of target architecture) a b c b d D SET CLR Q Q LUT D SET CLR Q Q always @(posedge Clock or negedge Reset) begin if (! Reset) q <= 0; else q <= (a & b & c) (b & d); end 26
Design Flow Placement & Route Placement assign logic location on a particular device LUT LUT LUT Routing iterative process to connect CLB inputs/outputs and IOBs. Optimizes critical path delay can take hours or days for large, dense designs Iterate placement if timing not met Satisfy timing? Generate Bitstream to config device Challenge! Cannot use full chip for reasonable speeds (wires are not ideal). Typically no more than 50% utilization. 27
Example: Verilog to FPGA module adder64 (a, b, sum); input [63:0] a, b; output [63:0] sum; assign sum = a + b; endmodule Synthesis Tech Map Place&Route 64-bit Adder Example Virtex II XC2V2000 Courtesy of Xilinx. Used with permission. 28
How are FPGAs Used? Prototyping Ensemble of gate arrays used to emulate a circuit to be manufactured Get more/better/faster debugging done than with simulation Reconfigurable hardware One hardware block used to implement more than one function Special-purpose computation engines Hardware dedicated to solving one problem (or class of problems) Accelerators attached to general-purpose computers (e.g., in a cell phone!) 29
Summary FPGA provide a flexible platform for implementing digital computing A rich set of macros and I/Os supported (multipliers, block RAMS, ROMS, high-speed I/O) A wide range of applications from prototyping (to validate a design before ASIC mapping) to highperformance spatial computing Interconnects are a major bottleneck (physical design and locality are important considerations) College students will study concurrent programming instead of C as their first computing experience. -- David B. Parlour, ISSCC 2004 Tutorial 30