CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

Similar documents
Why FPGAs? FPGA Overview. Why FPGAs?

Field Programmable Gate Arrays (FPGAs)

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

L11/12: Reconfigurable Logic Architectures

L12: Reconfigurable Logic Architectures

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

FPGA Design with VHDL

FPGA Design. Part I - Hardware Components. Thomas Lenzi

High Performance Carry Chains for FPGAs

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Outline Synchronous Systems Introduction Field Programmable Gate Arrays (FPGAs) Introduction Review of combinational logic

A Fast Constant Coefficient Multiplier for the XC6200

RELATED WORK Integrated circuits and programmable devices

Lecture 2: Basic FPGA Fabric. James C. Hoe Department of ECE Carnegie Mellon University

EEM Digital Systems II

EECS150 - Digital Design Lecture 3 Synchronous Digital Systems Review. Announcements

ECE 545 Lecture 1. FPGA Devices & FPGA Tools

Integrated circuits/5 ASIC circuits

Digital Systems Design

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 9 Field Programmable Gate Arrays (FPGAs)

Evaluation of Advanced Techniques for Structural FPGA Self-Test

9 Programmable Logic Devices

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science. EECS150, Spring 2011

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Chapter 7 Memory and Programmable Logic

Lecture 6: Simple and Complex Programmable Logic Devices. EE 3610 Digital Systems

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

Lossless Compression Algorithms for Direct- Write Lithography Systems

Combinational vs Sequential

IE1204 Digital Design. F11: Programmable Logic, VHDL for Sequential Circuits. Masoumeh (Azin) Ebrahimi

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

An Efficient High Speed Wallace Tree Multiplier

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

CS/ECE 250: Computer Architecture. Basics of Logic Design: ALU, Storage, Tristate. Benjamin Lee

3/5/2017. A Register Stores a Set of Bits. ECE 120: Introduction to Computing. Add an Input to Control Changing a Register s Bits

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

IE1204 Digital Design F11: Programmable Logic, VHDL for Sequential Circuits

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

ESE534: Computer Organization. Last Time. Last Time. Today. Preclass. Preclass. LUTs. Day 15: March 22, 2010 Compute 2: Cascades, ALUs, PLAs

A S. x sa1 Z 1/0 1/0

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Tajana Simunic Rosing. Source: Vahid, Katz

Implementation of Low Power and Area Efficient Carry Select Adder

Modeling Latches and Flip-flops

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

VU Mobile Powered by S NO Group

Good Evening! Welcome!

2. Logic Elements and Logic Array Blocks in the Cyclone III Device Family

Static Timing Analysis for Nanometer Designs

Remote Diagnostics and Upgrades

L14: Quiz Information and Final Project Kickoff. L14: Spring 2004 Introductory Digital Systems Laboratory

Automatic Transistor-Level Design and Layout Placement of FPGA Logic and Routing from an Architectural Specification

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

EECS 151/251A Spring 2018 Digital Design and Integrated Circuits Instructors: N. Weaver & J. Wawrzynek. Lecture 2 EE141

11. Sequential Elements

A video signal processor for motioncompensated field-rate upconversion in consumer television

Microprocessor Design

LUT Optimization for Memory Based Computation using Modified OMS Technique

Authentic Time Hardware Co-simulation of Edge Discovery for Video Processing System

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

FPGAs for bits & giggles

VLSI IEEE Projects Titles LeMeniz Infotech

Modeling Latches and Flip-flops

CS184a: Computer Architecture (Structures and Organization) Last Time

Built-In Self-Test of Embedded SEU Detection Cores in Virtex-4 and Virtex-5 FPGAs

Sequential Logic. Introduction to Computer Yung-Yu Chuang

ECE 263 Digital Systems, Fall 2015

Chapter Contents. Appendix A: Digital Logic. Some Definitions

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Designing for High Speed-Performance in CPLDs and FPGAs

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

XC4000E and XC4000X Series. Field Programmable Gate Arrays. Low-Voltage Versions Available. XC4000E and XC4000X Series. Features

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

Using the XSV Board Xchecker Interface

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

BIST-Based Diagnostics of FPGA Logic Blocks

Embedded System Design

FPGA Implementation of Sequential Logic

CHAPTER1: Digital Logic Circuits

EECS150 - Digital Design Lecture 2 - CMOS

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

COMPUTATIONAL REDUCTION LOGIC FOR ADDERS

Digital Integrated Circuits EECS 312

Advanced Training Course on FPGA Design and VHDL for Hardware Simulation and Synthesis. 26 October - 20 November, 2009

Midterm Exam 15 points total. March 28, 2011

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Sciences

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller


Transcription:

CDA 4253 FPGA System Design FPGA Architectures Hao Zheng Dept of Comp Sci & Eng U of South Florida

FPGAs Generic Architecture Also include common fixed logic blocks for higher performance: On-chip mem. DSP/MulHplier Fast arithmehc logic Microprocessors CommunicaHon logic

Programming Technologies

Programming Technologies: Fuses

Programming Technologies: Fuses

Programming Technologies: An?-fuses

Programming Technologies: An?-fuses

Programming Technologies: FLASH

Programming Technologies: SRAM Transistor SRAM 1 0 Open Closed

Sta?c RAM Cell

Basic Logic Elements (BLEs) Basic component that can be programmed to logic funchons and provide storage.

Lookup Tables (LUTs) SRAM SRAM SRAM SRAM 00 01 10 11 x y Commercial FPGAs Xilinx: 6-LUT Altera: 6-LUT Microsemi: 4-LUT For x-input LUT, it can be programmed into one of funchons. 2 2x

LUT = Programmable Truth Table x y x y z A B C 00 01 10 z 0 0 A 0 1 B 1 0 C 1 1 D D 11 Also called funchon generator.

AND x y x y z 0 0 0 00 01 10 z 0 0 0 0 1 0 1 0 0 1 1 1 1 11

OR x y x y z 0 1 1 00 01 10 z 0 0 0 0 1 1 1 0 1 1 1 1 1 11

NAND x y x y z 1 1 1 00 01 10 z 0 0 1 0 1 1 1 0 1 1 1 0 0 11

NOR x y x y z 1 0 0 00 01 10 z 0 0 1 0 1 0 1 0 0 1 1 0 0 11

XOR XNOR x y x y 00 00 01 10 z 01 10 z 11 11

z = y z = y + x x y x y 00 00 01 10 z 01 10 z 11 11

Features of LUTs A LUT is a piece of RAM. Can be configured as distributed RAM in Xilinx. Can be configured as shix registers. A n-lut can implement any n-input logic funchons. Logic minimizahon should reduce the number of inputs, not logical operators. All logic funchons implemented by a n-lut have the same propagahon delay.

Look-up-tables (LUTs) Why aren t FPGAs just a big LUT? Size of truth table grows exponentially based on # of inputs 3 inputs = 8 rows, 4 inputs = 16 rows, 5 inputs = 32 rows, etc. Same number of rows in truth table and LUT LUTs grow exponentially based on # of inputs Number of SRAM bits in a LUT = 2 i * o i = # of inputs, o = # of outputs Example: 64 input combinational logic with 1 output would require 2 64 SRAM bits 1.84 x 10 19 SRAM bits required. Clearly, not feasible to use large LUTs So, how do FPGAs implement logic with many inputs?

Look-up-tables (LUTs) Map circuits onto multiple LUTs Divide circuit into smaller circuits that fit in LUTs (same # of inputs and outputs) Example: 2-input LUTs

Configurable Logic Blocks Number of BLEs are grouped with a local network in order to implement funchons with a large number of inputs and mulhple outputs.

Configurable Logic Blocks (CLBs) Example: Ripple-carry adder Each LUT implements 1 full adder Use efficient connections between LUTs for carry signals A(1) B(1) A(0) B(0) Cin(0) 2x1 Cin(1) CLB 3-in, 2-out LUT 3-in, 2-out LUT FF FF FF FF 2x1 2x1 Cout(0) 2x1 2x1 Cout(1) S(1) S(0)

Programmable Interconnect

FPGA Rou?ng Architectures Must be flexible to accommodate various circuit implementa6ons.

Connec?on Boxes Programmable switches SRAM

Switch Boxes SRAM cell

Segmented Rou?ng Short wires: many, local connechons. Long wires: few, low latency, carrying global signals Dedicated long wires for clock/reset signals

Hierarchical Rou?ng Architecture Most designs display locality of connec6ons hierarchical rou6ng architecture.

FPGA Configura?on

Configura?on Comes at a Cost 1T 6T SRAM 4-6 T SRAM 4T SRAM + ConfiguraHon circuitry + Error detechon/correchon + Security features h^ps://en.wikipedia.org/wiki/stahc_randomaccess_memory

FPGA Design Flow

FPGA CAD Flow Input: A circuit (netlist) Output: FPGA configurahon bitstream Main (Algorithmic) Stages: Logic synthesis/ophmizahon Technology mapping Packing/placement RouHng Bitstream generahon

Xilinx FPGA Architecture DS099-1_01_032703

Xilinx 7-Series FPGA Architecture Precise, Low Ji^er Clocking On-Chip block RAM On-Chip block RAM Hi-performance Serial I/O Connec?vity Transceiver Technology Hi-performance Serial I/O Connec?vity Transceiver Technology Logic Fabric Logic Fabric DSP Slices

Xilinx Ar?x-7 Low end 7-series FPGA manufactured using 28nm Based on 6-input LUT Configurable as distributed memory Support DDR3 memory interfaces High-speed serial interfaces supporhng mulhgigabit communicahons On-chip DSPs, mulhpliers, and block RAMs Clock management Htles to provide high precise and low ji^er clock signals

Xilinx Ar?x-7 Device Logic Cells Configurable Logic Blocks (CLBs) Slices (1) Max Distributed RAM (Kb) DSP48E1 Slices (2) Block RAM Blocks (3) 18 Kb 36 Kb Max (Kb) XC7A15T 16,640 2,600 200 45 50 25 900 XC7A35T 33,280 5,200 400 90 100 50 1,800 XC7A50T 52,160 8,150 600 120 150 75 2,700 XC7A75T 75,520 11,800 892 180 210 105 3,780 XC7A100T 101,440 15,850 1,188 240 270 135 4,860 XC7A200T 215,360 33,650 2,888 740 730 365 13,140

Xilinx Ar?x-7 Device Configurable Logic Blocks (CLBs) CMTs Logic DSP48E1 Cells (4) PCIe (5) XADC GTPs Total I/O Max Slices (2) Slices (1) Blocks Banks (6) Distributed RAM (Kb) Block RAM Block Max User I/O (7) 18 Kb 36 Kb XC7A15T 16,640 5 1 2,600 4 1200 5 45 250 25 XC7A35T 33,280 5 1 5,200 4 1400 5 90 250 100 50 XC7A50T 52,160 5 1 8,150 4 1600 5 120 250 150 75 XC7A75T 75,520 6 111,800 8 1892 6 180 300 210 105 XC7A100T 101,440 6 115,850 8 1,188 1 6 240 300 270 135 XC7A200T 215,360 10 133,650 16 2,888 1 10 740 500 730 365 1

Xilinx Ar?x-7 - CLBs COUT COUT 8 6-LUTs 16 FFs 2 carry chains 256b distributed RAM 128b shi` register Switch Matrix CLB Slice(0) Slice(1) The abundant FFs can be used to improve design performance with pipelining. CIN CIN UG474_c1_01_071910

Xilinx Ar?x-7 CLBs Slice Architecture Slice LUT 4 6-LUTs 8 FFs Carry logic for fast addihon

Xilinx Ar?x-7 CLBs Slice Architecture Slice Wide-funcHon MUXs to implement funchons with 8 inputs. LUT LUT F7 MUX LUT F8 MUX LUT F7 MUX WP405_06_013012

Xilinx Ar?x-7 CLBs 6-LUTs 6-Input LUT D Q CE CLK S/R Register O6 O5 D Q CE CLK S/R Register Each 6-LUT implements any 6-input funchons, or Two 5-input funchons with shared inputs.

Distributed RAMs Slices in CLBs of type SLICEM can be configured as synchronous RAMs 256x1b single port 128x1b dual/single port Can also be configured as ROM with up to 256b. Can be instanhated by using special VHDL components.

HW, SW, and FPGA TradiHonal approaches to computahon: HW & SW HW (ASICs) Fixed on a parhcular applicahon Efficient: performance, silicon area, power Higher cost/per applicahon SW (microprocessors) Used in many applicahons Less efficient: performance, silicon area, power Lower cost/per applicahon

HW, SW, and FPGA Field Programmable Gate Arrays (FPGAs) SpaHal compuhng: similar to HW Reprogrammable: similar to SW Faster than SW and more flexible than HW Harder to program than SW Less efficient than HW: performance, power consumphon & silicon area

Temporal vs Spa?al Compu?ng (SW vs. HW) 2 y = Ax + Bx + C Temporal Computation Spatial Computation t1 = x t2 = t1 * A t2 = t2 + B t2 = t2 * t1 y = t2 + C t1 t2 A B x A * * * + B C C + Y

Why SW is Slower? Generality: InstrucHon set may not provide the operahons your program needs Processors provide hardware that may not be useful in every program or in every cycle of a given program: MulHpliers, Dividers InstrucHon Memory Program instruchons and intermediate results stored in memory. Accessing memory is very slow. Bit Width Mismatches General purpose processors have a fixed bit width, and all computahons are performed on that many bits

Why not just Use HW Dedicated -> not programmable. Takes long Hme and high cost to design and develop (typical processor takes a handful of years to design, with design teams of a few hundred engineers) High non-recurring cost (NRE) -> very expensive! JusHficaHon for high cost: high volume applicahons, or high-performance is more desired

ASIC vs FPGA

Reading Paper at h^p://www.cse.usf.edu/~zheng/teaching/cda4253/ FPGA Architectures: An Overview SecHon 2.1, 2.2, 2.3, 2.4 (skip 2.4.1.1, 2.4.2.2, 2.4.2.3), Skim 2.6