ECE 545 Lecture FPGA Devices & FPGA Tools George Mason University
Required Reading Xilinx, Inc. Spartan-3E FPGA Family Module : Introduction Features Architectural Overview Package Marking Module 2: Configurable Logic Block (CLB) and Slice Resources Dedicated Multipliers 2
Recommended Reading Xilinx, Inc. Spartan-3 Generation FPGA User Guide Extended Spartan-3A, Spartan-3E, and Spartan-3 FPGA Families 3
Two competing implementation approaches ASIC Application Specific Integrated Circuit designed all the way from behavioral description to physical layout designs must be sent for expensive and time consuming fabrication in semiconductor foundry FPGA Field Programmable Gate Array no physical layout design; design ends with a bitstream used to configure a device bought off the shelf and reconfigured by designers themselves 4
What is an FPGA? Configurable Logic Blocks Block RAMs Block RAMs I/O Blocks Block RAMs 5
Which Way to Go? ASICs High performance Low power Low cost in high volumes FPGAs Off-the-shelf Low development cost Short time to market Reconfigurability 6
Other FPGA Advantages Manufacturing cycle for ASIC is very costly, lengthy and engages lots of manpower Mistakes not detected at design time have large impact on development time and cost FPGAs are perfect for rapid prototyping of digital circuits Easy upgrades like in case of software Unique applications reconfigurable computing 7
Major FPGA Vendors SRAM-based FPGAs Xilinx, Inc. Altera Corp. Lattice Semiconductor Atmel ~ 5% of the market ~ 34% of the market ~ 85% Flash & antifuse FPGAs Actel Corp. Quick Logic Corp. 8
Xilinx u Primary products: FPGAs and the associated CAD software Programmable Logic Devices ISE Alliance and Foundation Series Design Software u Main headquarters in San Jose, CA u Fabless* Semiconductor and Software Company u UMC (Taiwan) {*Xilinx acquired an equity stake in UMC in 996} u Seiko Epson (Japan) u TSMC (Taiwan) u Samsung (Korea) 9
Xilinx FPGA Families Old families XC3000, XC4000, XC5200 Old 0.5µm, 0.35µm and 0.25µm technology. Not recommended for modern designs. High-performance families Virtex (220 nm) Virtex-E, Virtex-EM (80 nm) Virtex-II (30 nm) Virtex-II PRO (30 nm) Virtex-4 (90 nm) Virtex-5 (65 nm) Virtex-6 (40 nm) Low Cost Family Spartan/XL derived from XC4000 Spartan-II derived from Virtex Spartan-IIE derived from Virtex-E Spartan-3 (90 nm) Spartan-3E (90 nm) logic optimized Spartan-3A (90 nm) I/O optimized Spartan-3AN (90 nm) non-volatile, Spartan-3A DSP (90 nm) DSP optimized Spartan-6 (45 nm) 0
CLB Structure George Mason University
General structure of an FPGA Programmable interconnect Programmable logic blocks The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 3
Xilinx Spartan 3E CLB Configurable logic block (CLB) Slice Slice CLB CLB Logic cell Logic cell Logic cell Logic cell Slice Slice CLB CLB Logic cell Logic cell Logic cell Logic cell The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 4
CLB Slice = 2 Logic Cells COUT YB G4 G3 G2 G Look-Up O Table Carry & Control Logic Y S D CK EC R Q F5IN BY SR XB F4 F3 F2 F Look-Up Table O Carry & Control Logic X S D CK EC R Q CIN CLK CE SLICE 5
Xilinx Multipurpose LUT (MLUT) 6-bit SR 6 x RAM 4-input LUT The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 6
CLB Slice Structure Each slice contains two sets of the following: The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. Four-input LUT Any 4-input logic function, or 6-bit x sync RAM (SLICEM only) or 6-bit shift register (SLICEM only) Carry & Control Fast arithmetic logic Multiplier logic Multiplexer logic Storage element Latch or flip-flop Set and reset True or inverted inputs Sync. or async. control 7
CLB Structure 8
MLUT as 6x ROM 6-bit SR 6 x RAM 4-input LUT The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 9
LUT (Look-Up Table) in the Basic ROM Mode x x 2 x 3 x 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 y 0 0 0 0 x x 2 x 3 x 4 LUT y x x 2 x 3 x 4 x x 2 x 3 x 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 y 0 0 0 0 0 0 0 0 0 0 Look-Up tables are primary elements for logic implementation Each LUT can implement any function of 4 inputs x x 2 y y 20
5-Input Functions implemented using two LUTs One CLB Slice can implement any function of 5 inputs Logic function is partitioned between two LUTs F5 multiplexer selects LUT A4 A3 LUT ROM RAM D A2 A WS DI F5 F4 F3 A4 A3 WS DI D 0 F5 GXOR G X F2 F A2 A LUT ROM RAM BX nbx BX 0 2
5-Input Functions implemented using two LUTs X 5 X 4 X 3 X 2 X Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LUT LUT OUT 22
MLUT as 6x RAM 6-bit SR 6 x RAM 4-input LUT The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 23
Distributed RAM RAM6XS CLB LUT configurable as Distributed RAM A single LUT equals 6x RAM Two LUTs Implement Single and Dual-Port RAMs Cascade LUTs to increase RAM size Synchronous write Synchronous/Asynchronous read Accompanying flip-flops used for synchronous read LUT LUT LUT = RAM32XS D WE WCLK A0 A A2 A3 A4 or = O D WE WCLK A0 A A2 A3 RAM6X2S D0 D WE WCLK A0 A A2 A3 or O0 O O RAM6XD D WE A0 A A2 A3 WCLK SPO DPRA0 DPO DPRA DPRA2 DPRA3 24
MLUT as 6-bit Shift Register (SRL6) 6-bit SR 6 x RAM 4-input LUT The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 25
Shift Register Each LUT can be configured as shift register Serial in, serial out Dynamically addressable delay up to 6 cycles For programmable pipeline Cascade for greater cycle delays Use CLB flip-flops to add depth IN CE CLK LUT = LUT D CE D CE D CE Q Q Q OUT D CE Q DEPTH[3:0] 26
Using Multipurpose Look-Up Tables in the Shift Register Mode (SRL6) Inferred from behavioral description in VHDL for shift-registers with - one serial input, one serial output - no reset, no set 27
Cascading LUT Shift Registers into Shift Registers Longer than 6 bits 28
Shift Register 64 2 Cycles Operation A Operation B 4 Cycles 8 Cycles Operation C 3 Cycles 64 3 Cycles 9-Cycle imbalance Register-rich FPGA Allows for addition of pipeline stages to increase throughput Data paths must be balanced to keep desired functionality 29
Logic Cell = ½ of a CLB Slice 30
CLB Slice = 2 Logic Cells 3
Carry & Control Logic COUT YB G4 G3 G2 G Look-Up O Table Carry & Control Logic Y S D CK EC R Q F5IN BY SR XB F4 F3 F2 F Look-Up Table O Carry & Control Logic X S D CK EC R Q CIN CLK CE SLICE 32
Full-adder x y c out s FA x + y + c in = ( c out s ) 2 2 x y c out s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 c in 0 0 0 0 c in
Full-adder Alternative implementations x y c out s 0 0 0 0 c in 0 c in c in c in c in c in
Full-adder Alternative implementations Implementation used to generate fast carry logic in Xilinx FPGAs x y c out 0 0 y 0 c in 0 c in y p = x y g = y x y A2 A XOR D p g 0 C out C in S s= p c in = x y c in
Carry & Control Logic in Spartan 3 FPGAs LUT Hardwired (fast) logic
Critical Path for an Adder Implemented Using Xilinx Spartan 3/Spartan 3E FPGAs
Bottom Operand Input to Carry Out Delay T OPCYF 0.9 ns for Spartan 3
0.2 ns for Spartan 3 Carry Propagation Delay t BYP
Carry Input to Top Sum Combinational Output Delay T CINY.2 ns for Spartan 3
Fast Carry Logic u u Each CLB contains separate logic and routing for the fast generation of sum & carry signals Increases efficiency and performance of adders, subtractors, accumulators, comparators, and counters Carry logic is independent of normal logic and routing resources MSB LSB Carry Logic Routing 4
Accessing Carry Logic u All major synthesis tools can infer carry logic for arithmetic functions Addition (SUM <= A + B) Subtraction (DIFF <= A - B) Comparators (if A < B then ) Counters (count <= count +) 42
Embedded Multipliers George Mason University
RAM Blocks and Multipliers in Xilinx FPGAs RAM blocks Multipliers Logic blocks The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 44
Combinational and Registered Multiplier 45
Dedicated Multiplier Block 46
Interface of a Dedicated Multiplier 47
FPGA Block RAM 48
Block RAM Port A Spartan-3 Dual-Port Block RAM Port B Block RAM Most efficient memory implementation Dedicated blocks of memory Ideal for most memory requirements 4 to 36 memory blocks in Spartan 3E 8 kbits = 8,432 bits per block (6 k without parity bits) Use multiple blocks for larger memories Builds both single and true dual-port RAMs Synchronous write and read (different from distributed RAM) 49
RAM Blocks and Multipliers in Xilinx FPGAs RAM blocks Multipliers Logic blocks The Design Warrior s Guide to FPGAs Devices, Tools, and Flows. ISBN 0750676043 Copyright 2004 Mentor Graphics Corp. (www.mentor.com) 50
Block RAM can have various configurations (port aspect ratios) 0 0 2 0 4 8k x 2 4k x 4 4,095 6k x 8,9 0 2047 8+ 2k x (8+) 6,383 0 023 6+2 024 x (6+2) 5
Block RAM Port Aspect Ratios 52
Single-Port Block RAM DI[w-p-:0] DO[w-p-:0] 53
Dual-Port Block RAM DIA[w A -p A -:0] DOA[w A -p A -:0] DOA[w B -p B -:0] DIB[w B -p B -:0] 54
Input/Output Blocks (IOBs) George Mason University
Basic I/O Block Structure Three-State FF Enable Clock Set/Reset Output FF Enable D EC SR D EC SR Q Q Three-State Control Output Path Direct Input FF Enable Registered Input Q D EC SR Input Path 56
IOB Functionality IOB provides interface between the package pins and CLBs Each IOB can work as uni- or bi-directional I/O Outputs can be forced into High Impedance Inputs and outputs can be registered advised for high-performance I/O Inputs can be delayed 57
Spartan-3E Family Attributes George Mason University
Spartan-3E FPGA Family Members 59
FPGA Nomenclature 60
FPGA Design Flow George Mason University
Design flow () Design and implement a simple unit permitting to speed up encryption with RC5-similar cipher with fixed key set on 803 microcontroller. Unlike in the experiment 5, this time your unit has to be able to perform an encryption algorithm by itself, executing 32 rounds.. Specification (Lab Experiments) Library IEEE; use ieee.std_logic_64.all; use ieee.std_logic_unsigned.all; entity RC5_core is port( clock, reset, encr_decr: in std_logic; data_input: in std_logic_vector(3 downto 0); data_output: out std_logic_vector(3 downto 0); out_full: in std_logic; key_input: in std_logic_vector(3 downto 0); key_read: out std_logic; ); end AES_core; VHDL description (Your Source Files) Functional simulation Synthesis Post-synthesis simulation 62
Design flow (2) Implementation Timing simulation Configuration On chip testing 63
Tools used in FPGA Design Flow Functionally verified VHDL code Design VHDL code Synplicity Synplify Pro Xilinx XST Synthesis Netlist Xilinx ISE Implementation Bitstream 64
Synthesis George Mason University
Synthesis Tools Synplify Pro Xilinx XST and others 66
Logic Synthesis VHDL description Circuit netlist architecture MLU_DATAFLOW of MLU is signal A:STD_LOGIC; signal B:STD_LOGIC; signal Y:STD_LOGIC; signal MUX_0, MUX_, MUX_2, MUX_3: STD_LOGIC; begin A<=A when (NEG_A='0') else not A; B<=B when (NEG_B='0') else not B; Y<=Y when (NEG_Y='0') else not Y; end MLU_DATAFLOW; MUX_0<=A and B; MUX_<=A or B; MUX_2<=A xor B; MUX_3<=A xnor B; with (L & L0) select Y<=MUX_0 when "00", MUX_ when "0", MUX_2 when "0", MUX_3 when others; 67
Circuit netlist (RTL view) 68
Mapping LUT0 LUT4 LUT LUT2 LUT5 FF LUT3 FF2 69
Implementation George Mason University
Implementation After synthesis the entire implementation process is performed by FPGA vendor tools 7
72
Translation Synthesis Circuit netlist Electronic Design Interchange Format Timing Constraints Native Constraint File Constraint Editor or Text Editor EDIF NCF UCF User Constraint File Translation NGD Native Generic Database file 73
Pin Assignment FPGA B0 P0 H3 K2 G5 CLOCK CONTROL(0) CONTROL() CONTROL(2) RESET LAB5 SEGMENTS(0) SEGMENTS() SEGMENTS(2) SEGMENTS(3) SEGMENTS(4) SEGMENTS(5) SEGMENTS(6) H2 H6 H5 K3 H K4 G4 74
75
Example of an UCF File NET "CLOCK" LOC = "P0"; NET "reset" LOC = "B0"; NET "S_SEG0<6>" LOC = "H"; NET "S_SEG0<5> LOC = "G4"; NET "S_SEG0<4> LOC = "G5"; NET "S_SEG0<3> LOC = "H5"; NET "S_SEG0<2> LOC = "H6"; NET "S_SEG0<> LOC = "H3"; NET "S_SEG0<0> LOC = "H2"; 76
Mapping LUT0 LUT4 LUT LUT2 LUT5 FF LUT3 FF2 77
Placing FPGA CLB SLICES 78
Routing FPGA Programmable Connections 79
Configuration Once a design is implemented, you must create a file that the FPGA can understand This file is called a bit stream: a BIT file (.bit extension) The BIT file can be downloaded directly to the FPGA, or can be converted into a PROM file which stores the programming information 80
Technology independent Synthesis Two main stages of the FPGA Design Flow Technology dependent Implementation RTL Synthesis Map Place & Route Configure - Code analysis - Derivation of main logic constructions - Technology independent optimization - Creation of RTL View - Mapping of extracted logic structures to device primitives - Technology dependent optimization - Application of synthesis constraints - Netlist generation - Creation of Technology View - Placement of generated netlist onto the device - Choosing best interconnect structure for the placed design - Application of physical constraints - Bitstream generation - Burning device
Report files 82
Map report header Xilinx Mapping Report File for Design 'Lab3Demo' Design Information ------------------ Command Line : c:\xilinx\bin\nt\map.exe -p 3S500FG320-4 -o map.ncd -pr b -k 4 -cm area -c 00 Lab3Demo.ngd Lab3Demo.pcf Target Device : xc3s500 Target Package : fg320 Target Speed : -4 Mapper Version : spartan3 -- $Revision:.34 $ 83
Map report Design Summary -------------- Number of errors: 0 Number of warnings: 0 Logic Utilization: Number of Slice Flip Flops: 30 out of 26,624 % Number of 4 input LUTs: 38 out of 26,624 % Logic Distribution: Number of occupied Slices: 33 out of 3,32 % Number of Slices containing only related logic: 33 out of 33 00% Number of Slices containing unrelated logic: 0 out of 33 0% *See NOTES below for an explanation of the effects of unrelated logic Total Number 4 input LUTs: 62 out of 26,624 % Number used as logic: 38 Number used as a route-thru: 24 Number of bonded IOBs: 0 out of 22 4% IOB Flip Flops: 7 Number of GCLKs: out of 8 2% 84
Place & route report Asterisk (*) preceding a constraint indicates it was not met. This may be due to a setup or hold violation. ------------------------------------------------------------------------------------------------------ Constraint Requested Actual Logic Absolute Number of Levels Slack errors ------------------------------------------------------------------------------------------------------ * TS_CLOCK = PERIOD TIMEGRP "CLOCK" 5 ns 5.000ns 5.40ns 4-0.40ns 5 HIGH 50% ------------------------------------------------------------------------------------------------------ TS_genHz_ClockHz = PERIOD TIMEGRP "gen 5.000ns 4.37ns 2 0.863ns 0 "genhz_clockhz" 5 ns HIGH 50% ------------------------------------------------------------------------------------------------------ 85
Post layout timing report Clock to Setup on destination clock CLOCK ---------------+---------+---------+---------+---------+ Src:Rise Src:Fall Src:Rise Src:Fall Source Clock Dest:Rise Dest:Rise Dest:Fall Dest:Fall ---------------+---------+---------+---------+---------+ CLOCK 5.40 ---------------+---------+---------+---------+---------+ Timing summary: --------------- Timing errors: 9 Score: 543 Constraints cover 574 paths, 0 nets, and 87 connections Design statistics: Minimum period: 5.40ns (Maximum frequency: 94.553MHz) 86
Xilinx FPGA Devices Technology Low- cost High- performance 20/50 nm Virtex 2, 2 Pro 90 nm Spartan 3 Virtex 4 65 nm Virtex 5 45 nm Spartan 6 40 nm Virtex 6
Altera FPGA Devices Technology Low- cost Mid- range High- performanc e 30 nm Cyclone Stra<x 90 nm Cyclone II StraDx II 65 nm Cyclone III Arria I StraDx III 40 nm Cyclone IV Arria II StraDx IV
RTL view in Synplify Pro " General logic structures can be recognized in RTL view comparator incrementer MUX
Crossprobing between RTL view and code " Each port, net or block can be chosen by mouse click from the browser or directly from the RTL View " By double-clicking on the element its source code can be seen: " Reverse crossprobing is also possible: if section of code is marked, appropriate element of RTL View is marked too:
Technology View in Synplify Pro " Technology view is a mapped RTL view. It can be seen by pressing button or by double-click on.srm file " As in case of RTL View, buttons can be used here " Two additional buttons are enabled: - show critical path - open timing analyst Pay attention: technology view is usually large and presented on number of sheets Ports, nets and blocks browser Technology view is presented using device primitives
Viewing critical path " Critical path can be viewed by pressing on " Delay values are written near each component of the path
Timing Analyst " Timing analyst opened by pressing on " Timing analyst gives a possibility to analyze different paths in the design " Timing analyst can be opened only from Technology View