Exploring Architecture Parameters for Dual-Output LUT based FPGAs

Similar documents
Optimizing area of local routing network by reconfiguring look up tables (LUTs)

288 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 3, MARCH 2004

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Improving FPGA Performance with a S44 LUT Structure

EN2911X: Reconfigurable Computing Topic 01: Programmable Logic. Prof. Sherief Reda School of Engineering, Brown University Fall 2014

The Stratix II Logic and Routing Architecture

On the Sensitivity of FPGA Architectural Conclusions to Experimental Assumptions, Tools, and Techniques

Why FPGAs? FPGA Overview. Why FPGAs?

Latch-Based Performance Optimization for FPGAs. Xiao Teng

CAD Tool Flow for Variation-Tolerant Non-Volatile STT-MRAM LUT based FPGA

Fine-grain Leakage Optimization in SRAM based FPGAs

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

International Journal of Emerging Technologies in Computational and Applied Sciences (IJETCAS)

Field Programmable Gate Arrays (FPGAs)

On Hard Adders and Carry Chains in FPGAs

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

Automatic Transistor-Level Design and Layout Placement of FPGA Logic and Routing from an Architectural Specification

REDUCING DYNAMIC POWER BY PULSED LATCH AND MULTIPLE PULSE GENERATOR IN CLOCKTREE

Self-Test and Adaptation for Random Variations in Reliability

GlitchLess: An Active Glitch Minimization Technique for FPGAs

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

An FPGA Implementation of Shift Register Using Pulsed Latches

Introduction Actel Logic Modules Xilinx LCA Altera FLEX, Altera MAX Power Dissipation

An On-Chip Test Clock Control Scheme for Multi-Clock At-Speed Testing

University College of Engineering, JNTUK, Kakinada, India Member of Technical Staff, Seerakademi, Hyderabad

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Raising FPGA Logic Density Through Synthesis-Inspired Architecture

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

High Performance Carry Chains for FPGAs

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Leveraging Reconfigurability to Raise Productivity in FPGA Functional Debug

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

L11/12: Reconfigurable Logic Architectures

FPGA Glitch Power Analysis and Reduction

Modifying the Scan Chains in Sequential Circuit to Reduce Leakage Current

Novel Pulsed-Latch Replacement Based on Time Borrowing and Spiral Clustering

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Instructions. Final Exam CPSC/ELEN 680 December 12, Name: UIN:

Design and Implementation of FPGA Configuration Logic Block Using Asynchronous Static NCL

ESE534: Computer Organization. Previously. Today. Previously. Today. Preclass 1. Instruction Space Modeling

POWER OPTIMIZED CLOCK GATED ALU FOR LOW POWER PROCESSOR DESIGN

An Efficient Reduction of Area in Multistandard Transform Core

ESE (ESE534): Computer Organization. Last Time. Today. Last Time. Align Data / Balance Paths. Retiming in the Large

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

CSE140L: Components and Design Techniques for Digital Systems Lab. CPU design and PLDs. Tajana Simunic Rosing. Source: Vahid, Katz

L12: Reconfigurable Logic Architectures

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

A Novel Macroblock-Level Filtering Upsampling Architecture for H.264/AVC Scalable Extension

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

DIGITAL CIRCUIT LOGIC UNIT 9: MULTIPLEXERS, DECODERS, AND PROGRAMMABLE LOGIC DEVICES

FPGA Hardware Resource Specific Optimal Design for FIR Filters

Leakage Current Reduction in Sequential Circuits by Modifying the Scan Chains

A Fast Constant Coefficient Multiplier for the XC6200

Glitch Reduction and CAD Algorithm Noise in FPGAs. Warren Shum

Placement Rent Exponent Calculation Methods, Temporal Behaviour, and FPGA Architecture Evaluation. Joachim Pistorius and Mike Hutton

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

High Speed Reconfigurable FPGA Architecture for Multi-Technology Applications

FPGA Design with VHDL

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

A Scalable and High-Density FPGA Architecture with Multi-Level Phase Change Memory

VLSI IEEE Projects Titles LeMeniz Infotech

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

A Low Power Delay Buffer Using Gated Driver Tree

Design of Polar List Decoder using 2-Bit SC Decoding Algorithm V Priya 1 M Parimaladevi 2

A High-Resolution Flash Time-to-Digital Converter Taking Into Account Process Variability. Nikolaos Minas David Kinniment Keith Heron Gordon Russell

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Low Power and Area Efficient 256-bit Shift Register based on Pulsed Latches

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

Design of Low Power and Area Efficient 64 Bits Shift Register Using Pulsed Latches

Hardware Implementation of Block GC3 Lossless Compression Algorithm for Direct-Write Lithography Systems

An Integrated FPGA Design Framework: Custom Designed FPGA Platform and Application Mapping Toolset Development

Efficient 500 MHz Digital Phase Locked Loop Implementation sin 180nm CMOS Technology

New Single Edge Triggered Flip-Flop Design with Improved Power and Power Delay Product for Low Data Activity Applications

Random Access Scan. Veeraraghavan Ramamurthy Dept. of Electrical and Computer Engineering Auburn University, Auburn, AL

The Design of Efficient Viterbi Decoder and Realization by FPGA

Dual-V DD and Input Reordering for Reduced Delay and Subthreshold Leakage in Pass Transistor Logic

Clock Tree Power Optimization of Three Dimensional VLSI System with Network

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science. EECS150, Spring 2011

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

An Efficient High Speed Wallace Tree Multiplier

A Synthesis Oriented Omniscient Manual Editor

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

Abstract 1. INTRODUCTION. Cheekati Sirisha, IJECS Volume 05 Issue 10 Oct., 2016 Page No Page 18532

CS150 Fall 2012 Solutions to Homework 4

HIGH PERFORMANCE AND LOW POWER ASYNCHRONOUS DATA SAMPLING WITH POWER GATED DOUBLE EDGE TRIGGERED FLIP-FLOP

Modified128 bit CSLA For Effective Area and Speed

Implementation of Low Power and Area Efficient Carry Select Adder

Novel Design of Static Dual-Edge Triggered (DET) Flip-Flops using Multiple C-Elements

LFSR Counter Implementation in CMOS VLSI

Lossless Compression Algorithms for Direct- Write Lithography Systems

Power-Driven Flip-Flop p Merging and Relocation. Shao-Huan Wang Yu-Yi Liang Tien-Yu Kuo Wai-Kei Tsing Hua University

Low-Power and Area-Efficient Shift Register Using Pulsed Latches

Software Engineering 2DA4. Slides 3: Optimized Implementation of Logic Functions

A Symmetric Differential Clock Generator for Bit-Serial Hardware

Design & Simulation of 128x Interpolator Filter

Optimization of memory based multiplication for LUT

Transcription:

Exploring Architecture Parameters for Dual-Output LUT based FPGAs Zhenghong Jiang, Colin Yu Lin, Liqun Yang, Fei Wang and Haigang Yang System on Programmable Chip Research Department, Institute of Electronics, Chinese Academy of Sciences, Beijing, China The University of Chinese Academy of Sciences, Beijing, China Corresponding Author: yanghg@mail.ie.ac.cn

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

MOTIVATION Architecture Parameters Exploration is always a key part in FPGA-researches: Look-up Table Size, Cluster Size and Inputs, etc... Dual-Output Look-up Table replace traditional one to become the mainstream solution in FPGA chips: stratix II, III,... virtex 5, 6,... However, no published research explores design parameters for dual-output LUT based architecture. Although not innovative, a careful exploration of design parameters for the new FPGA architecture is still very necessary and helpful.

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

Parameters To Explore Traditional Parameters Look-up Table Size Number of Inputs to a Logic Cluster Size of Logic Cluster New Parameters for Dual-Output structure Ratio of shared inputs between two sub-luts Number of inputs of a dual-output LUT Largest unfractured LUT size

Architecture Parameter Description 公司 LOGO Benchmark circuits CAD Flow Technology mapping Use VTR Benchmarks as the evaluation objects; 1. Merge small LUTs 2. Pack LUTs and Registers Use Berkeley s ABC to compelete the mapping; Use VPR to do the physical synthesis; Several iterations will be done to get the final results of delay and area; Area and Delay Results Placement Routing Minimum Channel Width? Re-routing with 1.3X minimum channel width

Area and Delay Models 1 Model for LUTs Single-Output structure use full custom design based on a commercial 40nm technology Tsingle: Delay of a single LUT Asingle: Area of a single LUT Dual-Output structure since value of R varies, it s dual impossible Tsingleto implement all by hands Use a model from single-output LUT dual single mux mux T A A A Number

Area and Delay Models 2 Model for Crossbar Besides LUTs, the input crossbar in a logic cluster also takes a large area and delay. A semi-analytical method is used by choosing several points of input numbers to implement and using these data to fit equations to estimate delay 0.88 X T 0.28I X + 109.9 A NJ I I 2 A I A 1 2 1 2 X X X SRAM X buf

Area and Delay Models 3 Flip-flop and Output Multiplexer No difference between two architectures of single-output and dual-output LUT based FPGA use full custom design to get delay and area Routing Architecture Use classical island-style routing architecture with unidirectional singledriver wire segment parameters are extract from our own implementation

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

Experimental Groups by R The two figures on the right shows two extreme value of R; The specific range of R is different with different LUT size. When LUT size is large, it becomes hard to evaluate all possible value of R; Thus, we choose four representative values of R, 0, 1/3, 2/3 and 1, to explore, and they give results to capture the essential impact of different number of shared inputs ; R=0 R=1

From the figures, we can see that the values of I still show linear relationship with K and N, while only specific values differ with different values of R Number of Inputs to a Logic Clutster Number of Inputs to a logic cluster is a very important parameter for FPGA exploration, since it directly determines logic utility, area and delay of Logic-Cluster; Using the lowest value of I that provides 98% of max. cluster utilization is appropriate* For single-output, I = K(N+1)/2 R=0 R=1/3 We study the best I with relations of K and N under different R for dual-output: *E. Ahmed, J. Rose, The effect of LUT and cluster size on deepsubmicron FPGA performance and density, VLSI System, IEEE trans. On,12(3),288-298,2004 R=2/3 R=1

Estimation of Area The chip area of FPGA is always an important metric in FPGA architecture design. In our experiments, the area is measured in terms of total number of minimum-width transistors required to implement both logic and routing resources

Area versus R The figure on right side illustrates the best ratio of shared inputs at each combination of K and N: There does not exist a ratio of shared inputs that can be applied universally for all architectures to get the best area result The best value of R for minimum area increases with the growth of K and N Except for some small values of K and N, R = 2/3 is the best choice for 78% of all cases

Area Under Best R The figure shows the value of Area with different K and N while the best value of R is choosen A small LUT size with a large size of cluster is the preferred combination for areaefficiency.

Area Breakdown To better understand what s different between dualoutput and single-output LUT, we breakdown the total area into Logic and Routing parts: Serveral Observations can be obtain: Routing area is less important with the growth of LUT size. When LUT size reaches 6, logic area becomes the dominant part of total area Dual-output architecture gains most benefits from the logic part while only a little improvement in routing area. Dual-output architecture gives better area-efficiency under large LUT sizes Dual-Output Logic: Black Solid Route: Black Dashed Single-Output Logic: Red Solid Route: Red Dashed

Delay versus R The figure shows the distribution of best choice of R under different K and N: Different from area, the pattern for best value of R is not clear in the figure, indicates that performance has a weak correlation with the ratio of shared inputs. Although the statistic data gives an expression of distribution of best R, all deviations of delay for the four R s are less than 10%, which proves that the value of R has little impact on delay.

Delay Under Best R Similar to the area, right figure illustrates the tendency of delay with different combinations of K and N while the best R is chosen at each point: performance requires large sizes of LUTs and clusters which show different relation from area

Delay Comparison Dual-Output: black Single-Output: red Observations: The difference between dual-output and traditional single-output architecture is not that obvious; In statistic, deviations of total delay between the two architectures are less than 6.5%; Therefore, unlike area-efficiency, performance is not obviously improved by dual-output architecture;

Area-Delay-Product versus R In previous discussion, show two opposite tendency: The figure illustrates best R at different area-efficiency combination perfers of K small and LUT N size performance requires a LUT size as large as possible difference in delay between four R s is small, as a result, area is the dominant Area-delay product is a way used to make a trade-off factor in area-delay product between area and performance the distribution of R s for area-delay product is similar to the distribution for area. However, the delay metric still plays a role in the decision of R, thus the portion of R = 1/3 increases due to delay benefits

Area-Delay-Product Comparison Dual-Output: black Single-Output: red Advantage of Dual-Output architecture becomes more obvious with the growth of LUT size. The best area-delay products comes at a LUT size of 5 to 6, which is larger than the traditional results of 4.

Contents 1 MOTIVATION 2 PRELIMINARIES 3 Experimental Results 4 Conclusion

Conclusion We list the summary of best parameters for different design goals: Criteria Single-Output Dual-Output K N K N R Area 4 7 to 15 4 to 5 7 to 11 1/3 to 2/3 Delay 9 7 to 15 9 6 to 15 0 to 1/3 Area-Delay 4 11 to 15 5 8 to 12 1/3 to 2/3 There does not exist a single ratio of shared inputs giving advantages in any size of LUT and cluster. Dual-output is an efficient way to reduce area cost in large LUT size while not a good way to improve performance.

Suggestions and discussions are welcome by email: yanghg@mail.ie.ac.cn Thank you!