EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

Similar documents
Cyclone II EPC35. M4K = memory IOE = Input Output Elements PLL = Phase Locked Loop

2. Logic Elements and Logic Array Blocks in the Cyclone III Device Family

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

L12: Reconfigurable Logic Architectures

Field Programmable Gate Arrays (FPGAs)

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Why FPGAs? FPGA Overview. Why FPGAs?

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

L11/12: Reconfigurable Logic Architectures

High Performance Carry Chains for FPGAs

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Optimizing area of local routing network by reconfiguring look up tables (LUTs)

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

The Stratix II Logic and Routing Architecture

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

A Fast Constant Coefficient Multiplier for the XC6200

An Application Specific Reconfigurable Architecture Diagnosis Fault in the LUT of Cluster Based FPGA

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

Placement Rent Exponent Calculation Methods, Temporal Behaviour, and FPGA Architecture Evaluation. Joachim Pistorius and Mike Hutton

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

288 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

Designing for High Speed-Performance in CPLDs and FPGAs

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Testability: Lecture 23 Design for Testability (DFT) Slide 1 of 43

A S. x sa1 Z 1/0 1/0

On the Sensitivity of FPGA Architectural Conclusions to Experimental Assumptions, Tools, and Techniques

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

EEM Digital Systems II

RELATED WORK Integrated circuits and programmable devices

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Electrical and Telecommunications Engineering Technology_TCET3122/TC520. NEW YORK CITY COLLEGE OF TECHNOLOGY The City University of New York

FPGA Design with VHDL

Laboratory Exercise 4

Chapter 7 Memory and Programmable Logic

Distributed Arithmetic Unit Design for Fir Filter

Using the Quartus II Chip Editor

3. Configuration and Testing

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

L14: Quiz Information and Final Project Kickoff. L14: Spring 2004 Introductory Digital Systems Laboratory

Scan. This is a sample of the first 15 pages of the Scan chapter.

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Lossless Compression Algorithms for Direct- Write Lithography Systems

Automatic Transistor-Level Design and Layout Placement of FPGA Logic and Routing from an Architectural Specification

Implementation of Low Power and Area Efficient Carry Select Adder

IE1204 Digital Design F11: Programmable Logic, VHDL for Sequential Circuits

Lecture 23 Design for Testability (DFT): Full-Scan

Lecture 14: Computer Peripherals

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

Improving FPGA Performance with a S44 LUT Structure

INTERMEDIATE FABRICS: LOW-OVERHEAD COARSE-GRAINED VIRTUAL RECONFIGURABLE FABRICS TO ENABLE FAST PLACE AND ROUTE

CS184a: Computer Architecture (Structures and Organization) Last Time

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Introductory Digital Systems Laboratory

A Low-Power 0.7-V H p Video Decoder

Microprocessor Design

BIST to Diagnosis Delay Fault in the LUT of Cluster Based FPGA

Modeling and simulation of altera logic array block using quantum-dot cellular automata

Contents Slide Set 6. Introduction to Chapter 7 of the textbook. Outline of Slide Set 6. An outline of the first part of Chapter 7

Self-Test and Adaptation for Random Variations in Reliability

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Lecture 2: Basic FPGA Fabric. James C. Hoe Department of ECE Carnegie Mellon University

FPGA Implementation of DA Algritm for Fir Filter

GlitchLess: An Active Glitch Minimization Technique for FPGAs

Design and Analysis of Modified Fast Compressors for MAC Unit

White Paper Versatile Digital QAM Modulator

CSE140L: Components and Design Techniques for Digital Systems Lab. FSMs. Tajana Simunic Rosing. Source: Vahid, Katz

Lecture 23 Design for Testability (DFT): Full-Scan (chapter14)

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

An Efficient High Speed Wallace Tree Multiplier

IE1204 Digital Design. F11: Programmable Logic, VHDL for Sequential Circuits. Masoumeh (Azin) Ebrahimi

Institutionen för systemteknik

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

9 Programmable Logic Devices

Achieving Timing Closure in ALTERA FPGAs

Fine-grain Leakage Optimization in SRAM based FPGAs

Combinational vs Sequential

Lucent ORCA OR2C15A-2S208 FPGA Circuit Analysis

VLSI IEEE Projects Titles LeMeniz Infotech

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Experiment: FPGA Design with Verilog (Part 4)

Bit Swapping LFSR and its Application to Fault Detection and Diagnosis Using FPGA

SoC IC Basics. COE838: Systems on Chip Design

Solution to Digital Logic )What is the magnitude comparator? Design a logic circuit for 4 bit magnitude comparator and explain it,

AbhijeetKhandale. H R Bhagyalakshmi

BIST for Logic and Memory Resources in Virtex-4 FPGAs

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Multiband Noise Reduction Component for PurePath Studio Portable Audio Devices

Innovative Fast Timing Design

CSCB58 - Lab 4. Prelab /3 Part I (in-lab) /1 Part II (in-lab) /1 Part III (in-lab) /2 TOTAL /8

A Symmetric Differential Clock Generator for Bit-Serial Hardware

V6118 EM MICROELECTRONIC - MARIN SA. 2, 4 and 8 Mutiplex LCD Driver

Transcription:

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic Prof. Sherief Reda School of Engineering, Brown University Fall 2014 1

Contents 1. Architecture of modern FPGAs Programmable interconnect Programmable logic blocks 2. How to design FPGAs? 3. Case studies 2

1. FPGA architecture Programmable interconnect Programmable logic blocks [Maxfield 04] Programmable logic element Objective: study organization of programmable logic blocks and interconnects 3

Block logic element (BLE) [Rose 04] [Maxfield 04] How is the number of bits in a K-input table? How many Boolean functions can a K-input LUT implement? What is the best LUT size? 4

A closer look at the BLE 5

Larger circuits needs to be decomposed into a number of BLEs [Figure from Cong FPGA 01] 6

Example [from J. Zambreno] F = A 0 A 1 A 3 + A 1 A 2 Ā 3 + Ā 0 Ā 1 Ā 2 4-input LUT 3-input LUT 2-input LUT 7

Logic block clusters (logic array block LAB, configurable logic block CLB) Assume K-input LUT in each BLE and assume N BLEs per logic cluster The BLEs in each logic clusters are fully connected or nearly-fully connected Why I is less than K N? [Betz-Rose 97] 8

Heterogeneous reconfigurable logic Reconfigurable fabric might contain non-reconfigurable elements that interface to the logic blocks through the programmable interconnect fabric Examples: Embedded memory Embedded multipliers, adders, MAC Embedded processors 9

Embedded memory blocks Costly to implement memory with configurable logic blocks add hard chunks of RAM blocks Position/size vary depending on the FPGA device. Size varies from few thousands (or tens of thousands) per RAM block [Maxfield 04] Each block can be used independently or combined to form larger RAM blocks Could be single or dual-port RAMs 10

Embedded multipliers and adders Multipliers are inherently slow if implemented by connecting a large number of programmable logic blocks add hard-wired multiplier blocks Typically located close to the embedded RAM blocks Some FPGA use Multiply-And- Accumulate (MAC) blocks (useful in DSP applications) 11

Programmable routing Wires provide the necessary communication fabric to route the output of one computational node to the inputs of another computational node Why routing is very crucial? Routing resources occupy a larger area than logic resources in an FPGA Wire delay grows quadratically as a function of its length Technology scaling reduces device delay but increases wire delay 12

General routing definitions track channel segment CLB CLB CLB CLB A wire segment is a wire unbroken by programmable switches A track is a sequence of one or more wire segments in a line. The segments could be connected by switches at their ends A routing channel is a group of parallel tracks. The channel width is the number of tracks in the channel 13

Connection blocks: formed where CLB input or output pins connect to the routing channels Life would have been easy if only logic blocks within the same column or row need to communicate! 14

Segment-segment switch design for bidirectional wires track channel segment CLB CLB CLB CLB [Lemieux 04] 15

Switch blocks: formed wherever horizontal and vertical channels intersect Switch box Switch box size grows quadratically as a function of the number of its input wires 16

Bidirectional switch details [Lemieux 04, Tessier] 17

Segmented and hierarchical routing segmented routing hierarchical routing Short wires accommodate local traffic Short wires can be connected together using switch boxes to emulate longer wires Also contain long wires to allow efficient communication without passing through switches Routing within a group of logic blocks occur at the local level Longer hierarchical wires connect different groups 18

2. How to design an FPGA? Key design parameters: K à design of BLE N à number of BLEs per LAB I à external input connection to LAB W à Number of wire tracks in a channel Goals: design area performance area-delay product power 19

Methodology [from Ahmed & Rose 04] Need an architectural design flow that use CAD tools together with benchmark circuits 20

Determine W (#tracks/channel) To determine W for a given K, N and I: continuously route each circuit, removing tracks from the architecture until routing falls Add 30% more tracks to the minimum track count and then perform final low stress routing, and use that to measure the critical path delay 21

How to determine I given K and N? [from Ahmed & Rose 04] Experiments aimed for 98% LUT utilization I should be a function of (the LUT size) and (the number of LUTs in a cluster). Larger I implies larger and slower MUXs I is less than K N; why? Empirically I= K (N+1)/2 22

FPGA area as a function of K and N? [from Ahmed & Rose 04] LUT sizes of 4 and 5 are the most area efficient for all cluster sizes. There is a reduction in total area as the cluster size is increased from 1 to 3 for all LUT sizes. However, as clusters are made larger, there is very little impact on total FPGA area. Why? Increasing K and/or N increases intra-cluster area but reduces intercluster area. 23

Impact of LUT size (K) and LUTs per cluster (N) on intra-cluster area? [from Ahmed & Rose 04] Two reasons intra-cluster area increases with K: 1. The logic block area grows exponentially with LUT size as there are 2 k bits in a k-input LUT. 2. Larger LUT sizes require larger intra-cluster multiplexers because the size of each multiplexer is I + N = K(N + 1)/2+N 24

Impact of LUT size (K) and LUTs per cluster (N) on inter-cluster area? [from Ahmed & Rose 04] The inter-cluster routing area decreases in a linear fashion with increasing LUT size. 25

Final impact on area + = [from Ahmed & Rose 04] Early increases in K (2à 5) leads to modest increase in intra-cluster area with steady reductions in inter-cluster areas. Subsequent increases in K leads to large intra-cluster area increases that offset inter-cluster area reductions 26

Impact of K and N on performance As the LUT and cluster size increase Delay of the LBE and the delay through a single cluster increases; Number of LBEs and clusters on the critical path decreases + = [from Ahmed & Rose 04] K = 6 seems the best for performance 27

Adaptive LUTs Motivation: A fixed K addresses for the average behavior of circuits à Some circuits benefit from higher K and some others benefits from lower K. Higher K (K = 6) improves performance but lower K (e.g., K=4) is better for area Can we have adaptive LUTs (i.e.., sometimes used as 6- input LUTs but sometimes used as multiple 4-input LUTs?)? Composable LUTs Fracturable LUTs 28

Composable LUTs Composable 6-LUT constructed from 4-LUTs All pins are independent à How many inputs pins are required more than a standard 6-input LUT? What is the cost of a pin? 29

Fracturable LUTs Example of (6, 2) fracturable LUT Sharing input pins improves area overhead but reduces logic flexibility (k=6, m=2) LUT can implement Any 6-input function Two 5-input functions (must share two inputs) Any two 4-input functions 30

Programming the FPGA Configuration data in Configuration data out = I/O pin/pad = SRAM cell Configuration memory that determine the programmability of the logic blocks and interconnects 31

Programmable switch technology Anti-fuse SRAM Switch by default is OFF; when programmed it is ON. Advantages: negligible delay small area overhead Disadvantages: not really reconfigurable; one time programmable Flash Switch by default is ON; when programmed it is OFF. Advantages: programming not lost when device is turned off. Disadvantages: requires more manufacturing steps SRAM bit cell stores the programmability of the device Advantages: can be reconfigured quickly and as repeatedly as required no special fabrication steps Disadvantages: takes more area loses charge when turned off 32

3. Case study: Altera s Cyclone II device Two dimensional array of Logic Array Blocks (LABs), with 16 Logic Elements (LEs) in each LAB. Embedded memory blocks (M4K) and multipliers (18x18) PLL (Phased Locked Loops) are used to generate clock signal for a range of frequencies EP2C35 (in DE2 board) has 60 columns and 45 rows for a total of 33216 LEs. 105 M4K blocks and 35 embedded multipliers. 33

Logic element organization (normal mode) The LE has two operating modes: normal and arithmetic Normal mode is suitable for general logic implementation 4-input LUT 6 input connections 3 output connections LAB-wide synchronous/asynchronous clear and load signals. Clock signal 34

Logic element organization (arithmetic mode) Arithmetic mode is suitable for implementing adders, counters, accumulators and comparators The LUT is split into two 3-input LUTs (ideal for implementing 2-bit full adders) and basic carry chain 35

Logic array block organization Each LAB consists of the following: 16 LEs, LAB control signals, LE carry chains, register chains and local interconnects Local interconnects transfer signals between LEs in the same LAB and is driven by column and row interconnects and LE outputs within the same LAB Neighboring LABs, PLLs, M4K RAM and multipliers from the left and right can also drive an LAB s local interconnect Each LE can drive 48 LEs through fast local and direct interconnects 36

Register/carry chain connections with a LAB 37

Multi-track interconnects Multitrack interconnect consists of row (directlink, R4, R24) and column (register chain, C4, C16) R4/C4 interconnects spans 4 blocks (right, left / top, down) R24/C16 spans 24/16 blocks and connects to R4/C4 interconnects R4/C4 can drive each other to extend their range 38

C4 interconnections C4 interconnects drive local and R4 interconnect up to 4 rows C16 column interconnects span 16 LABs and provide long column connections C16 column interconnects indirectly drive LAB local interconnects via C4 and R4 and interconnects 39

Embedded RAMs and multipliers 4608 RAM bits (w or w/o parity) 250 MHz performance Either single or dual port memory Can also be configured as FIFO ideal for DSP applications 250 Mhz performance Either configured as one 18 bit multiplier or two independent 9 bit multipliers 40

IO Element (IOE) structure IO Element (IOE) structure (allows bidirectional signals) 5 IOE per row I/O block Row I/O blocks drive C4, R4, R24 or direct link interconnects. Column I/O blocks drive C4, C16 interconnects 41

Summary Architectural makeup of FPGAs: LUTS, routing MUXes, switches, multipliers, memory, etc. Design and area overhead of programmable LUTs and interconnects FPGA design methodology Impact of FPGA design parameters on area and delay à choosing optimal parameters Case study on Altera Cyclone II FPGA 42