Design of Address Generators Using Multiple LUT Cascade on FPGA

Similar documents
On the Design of LPM Address Generators Using Multiple LUT Cascades on FPGAs

A Reconfigurable Frame Interpolation Hardware Architecture for High Definition Video

4.5 Pipelining. Pipelining is Natural!

Multiple Bunch Longitudinal Dynamics Measurements at the Cornell Electron-Positron Storage Ring

Version Capital public radio. Brand, Logo and Style Guide

H-DFT: A HYBRID DFT ARCHITECTURE FOR LOW-COST HIGH QUALITY STRUCTURAL TESTING

Compact Beamformer Design with High Frame Rate for Ultrasound Imaging

Precision Interface Technology

Grouping and Retrieval Schemes for Stored MPEG. Video. Senthil Sengodan, Victor O. K. Li. University of Southern California

A Fast Constant Coefficient Multiplier for the XC6200

Precision Interface Technology

An Update Method for a Low Power CAM Emulator using an LUT Cascade Based on an EVMDD (k)

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /VETECF.2002.

Ranking Fuzzy Numbers by Using Radius of Gyration

Examples of FPLD Families: Actel ACT, Xilinx LCA, Altera MAX 5000 & 7000

R&D White Paper WHP 119. Mezzanine Compression for HDTV. Research & Development BRITISH BROADCASTING CORPORATION. September R.T.

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

Texas Bandmasters Association 2016 Convention/Clinic

A Method to Decompose Multiple-Output Logic Functions

Computer and Digital System Architecture

Study on evaluation method of the pure tone for small fan

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

Ausroc III Telemetry System

Laboratory Exercise 6

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

e-workbook TECHNIQUES AND MATERIALS OF MUSIC Part I: Rudiments

A QUERY BY HUMMING SYSTEM THAT LEARNS FROM EXPERIENCE

Grant Spacing Signaling at the ONU

RBM-PLDA subsystem for the NIST i-vector Challenge

Stochastic analysis of Stravinsky s varied ostinati

(2'-6") OUTLINE OF REQUIRED CLEAR SERVICE AREA

Laboratory Exercise 3

Product Obsolete/Under Obsolescence

C2 Vectors C3 Interactions transfer momentum. General Physics GP7-Vectors (Ch 4) 1

Polar Decoder PD-MS 1.1

Melodic Similarity - a Conceptual Framework

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Lecture 6: Simple and Complex Programmable Logic Devices. EE 3610 Digital Systems

CAD for VLSI Design - I Lecture 38. V. Kamakoti and Shankar Balachandran

CDA 4253 FPGA System Design FPGA Architectures. Hao Zheng Dept of Comp Sci & Eng U of South Florida

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

Improvement of Design Issues in Sequential Logic Circuit with Different CMOS Design Techniques

A 0.8 V T Network-Based 2.6 GHz Downconverter RFIC

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

The game of competitive sorcery that will leave you spellbound.

A Low Power Implementation of H.264 Adaptive Deblocking Filter Algorithm

Chapter 4. Minor Keys and the Diatonic Modes BASIC ELEMENTS

other islands for four players violin, soprano sax, piano & computer nick fells 2009

12/31/2010. Overview. 12-Latches and Flip Flops Text: Unit 11. Sequential Circuits. Sequential Circuits. Feedback. Feedback

Spiral Content Mapping. Spiral 2 1. Learning Outcomes DATAPATH COMPONENTS. Datapath Components: Counters Adders Design Example: Crosswalk Controller

FPGA Implementation of Sequential Logic

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Forward Error Correction on ITU-G.709 Networks using Reed-Solomon Solutions Author: Michael Francis

Guidelines on preparing and submitting an article for the Bulletin of the John Rylands Library

LISG Laser Interferometric Sensor for Glass fiber User's manual.

GS4882, GS4982 Video Sync Separators with 50% Sync Slicing

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

Scalable Music Recommendation by Search

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science. EECS150, Spring 2011

Clock Gating Aware Low Power ALU Design and Implementation on FPGA

PROCESSOR BASED TIMING SIGNAL GENERATOR FOR RADAR AND SENSOR APPLICATIONS

Aalborg Universitet. Published in: I E E E Transactions on Power Delivery. DOI (link to publication from Publisher): /TPWRD.2010.

Memory efficient Distributed architecture LUT Design using Unified Architecture

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

FPGA Design with VHDL

A Low Cost Scanning Fabry Perot Interferometer for Student Laboratory

Characterization of Traditional Thai Musical Scale

ESE534: Computer Organization. Today. Image Processing. Retiming Demand. Preclass 2. Preclass 2. Retiming Demand. Day 21: April 14, 2014 Retiming

Music Technology Advanced Subsidiary Unit 1: Music Technology Portfolio 1

High-Performance DDR2 SDRAM Interface Data Capture Using ISERDES and OSERDES Author: Maria George

LUT OPTIMIZATION USING COMBINED APC-OMS TECHNIQUE

FPGA Design. Part I - Hardware Components. Thomas Lenzi

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Field Programmable Gate Arrays (FPGAs)

Design and Implementation of Encoder for (15, k) Binary BCH Code Using VHDL

L12: Reconfigurable Logic Architectures

L11/12: Reconfigurable Logic Architectures

CpE 442. Designing a Pipeline Processor (lect. II)

Midterm Exam 15 points total. March 28, 2011

Day 21: Retiming Requirements. ESE534: Computer Organization. Relative Sizes. Today. State. State Size

LogiCORE IP Spartan-6 FPGA Triple-Rate SDI v1.0

Design of Memory Based Implementation Using LUT Multiplier

EWCM 900. technical user manual. electronic controller for compressors and fans

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

Long-Term Mechanical Properties of Smart Cable Based on FBG Desensitized Encapsulation Sensors

Experimental Investigation of the Effect of Speckle Noise on Continuous Scan Laser Doppler Vibrometer Measurements

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 5, MAY

EE178 Spring 2018 Lecture Module 5. Eric Crabill

Available online at ScienceDirect. Procedia Computer Science 46 (2015 ) Aida S Tharakan a *, Binu K Mathew b

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

An Efficient Reduction of Area in Multistandard Transform Core

Module 3. Logic Circuits With Memory

Design and Implementation of Partial Reconfigurable Fir Filter Using Distributed Arithmetic Architecture

International Journal of Engineering Research-Online A Peer Reviewed International Journal

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

Auburn University Marching Band

Dynamically Reconfigurable FIR Filter Architectures with Fast Reconfiguration

Transcription:

Deign of Adde Geneato Uing Multiple LUT Cacade on FPGA Hui Qin and Tutomu Saao Depatment of Compute Science and Electonic, Kyuhu Intitute of Technology 680 4, Kawazu, Iizuka, Fukuoka, 80 850, Japan Abtact Thi pape peent multiple LUT cacade to ealize an adde geneato that poduce unique addee anging fom 1 to k fo k ditinct input vecto. We implemented ix kind of adde geneato uing multiple LUT cacade, Xilinx CAM (Xilinx IP coe), and an adde geneato uing egite and gate on Xilinx Spatan-3 FPGA. One of ou implementation ha 76% moe thoughput, 9.5 time moe thoughput/lice, and 1.35 time moe thoughput/memoy than Xilinx CAM. I. INTRODUCTION An odinay memoy poduce an output data value given an input adde value. On the contay, an adde geneato poduce an output adde value coeponding to the applied input data value. If the input data value i not toed, a pecial output adde i poduced (e.g. 0). It i aumed that any data value i toed at exactly one adde. Theefoe, an input data value poduce a unique output adde value. The adde geneato ha a boad ange of application, including data compeion, outing table in the intenet [1], netwok witche, and dictionay eaching. Although an adde geneato can be implemented by a content addeable memoy (CAM) [], the CAM diipate moe powe than a conventional RAM [3]. Xilinx [4] povide a deign fo the CAM implemented with block RAM (BRAM) of the Xilinx FPGA. Note that Xilinx CAM i an IP (intellectual popety) coe. Anothe ealization of the adde geneato ue egite and logic gate. In thi ealization, the inteconnection tend to be vey complicated. Recently, Saao ha hown that a multiple-valued input adde geneato can be ealized by an LUT (look-up table) cacade that ue conventional RAM and gate [7]. In thi pape, we popoe an extenion to an LUT cacade ealization fo a two-valued input adde geneato: a multiple LUT cacade ealization that i eaily econfigued when additional binay vecto ae equied. In the multiple LUT cacade achitectue (Fig. 3), the input of each LUT cacade ae common with othe LUT cacade, and the output of each LUT cacade ae connected to an encode. The LUT cacade ae ued to ealize adde geneation function and the encode i ued to geneate the index fom the output of cacade. Since both Xilinx CAM and the multiple LUT cacade ue BRAM, it i inteeting to compae the multiple LUT cacade with Xilinx CAM on the ame FPGA. Ou bai fo compaion i adde geneation function with 48 input-vaiable and 60 63 egiteed vecto implemented on a Xilinx Spatan-3 FPGA. Fit, by uing the multiple LUT cacade, we deigned ix adde geneato: 3p1, 4p1, 4p1o, 5p, 5po and 6p. Then, we ued the Xilinx Coe Geneato tool to poduce a CAM. Finally, by uing egite and logic gate, we implemented an adde geneato called Reg-Gate. Reg-Gate ha the mallet delay, but equie many lice. In tem of the equivalent thoughput, Reg- Gate i lowe than othe implementation. The multiple LUT cacade poduce highe thoughput, highe thoughput/lice, and highe thoughput/memoy than Xilinx CAM. 6p ha 76% moe thoughput, 9.5 time moe thoughput/lice, 1.35 time moe thoughput/memoy than Xilinx CAM. In addition, if the aea fo one BRAM i le than o equal to the aea fo 64 lice, 6p i moe efficient than Xilinx CAM in tem of delay aea poduct, although it ha 97% moe delay than Xilinx CAM. The et of the pape i oganized a follow: Section decibe the multiple LUT cacade. Section 3 how othe ealization fo the adde geneato. Section 4 peent the implementation of the adde geneato uing an FPGA. Section 5 how the expeimental eult. And finally, Section 6 conclude the pape. A. Adde Geneato II. MULTIPLE LUT CASCADE Definition.1 Let { a 1, a,..., a k } be a et of ditinct binay vecto of n bit. An n-input adde geneation function i a mapping F ( x) :{0, 1} n {0, 1,..., k}, whee F ( x i )= { i if xi = a i 0 othewie. k i the weight of the function (the numbe of non-zeo output value). a i i a egiteed vecto. That i, F poduce an adde anging fom 1 to k fo the egiteed vecto, and poduce 0 fo all othe ( n k) input vecto. Typical application of adde geneato include: Data compeion in communication - Ued to poduce an equivalent, but hote meage. With an adde geneato, the compeed data epeent the ame infomation uing fewe bit. Routing table in the intenet - Ued to geneate the output of the detination adde fom the incoming IP (Intenet Potocol) adde. In the IPv4, an IP adde i epeented by 3 bit, and the output adde i epeented by 8 to 16 bit.

A H #ail = log B Fig. 1. Decompoition fo the function F. G F ( x A, x B ) Theoem.1 [7] An abitay n-input adde geneato with weight k can be ealized by an LUT cacade, whee each cell conit of a memoy with p adde line and output. Let be the neceay numbe of level o cell. Then, we have the elation: n, (1) p 1 1 Fig.. LUT cacade. f ( x 1, x,..., x ) Netwok witch - Ued to poce the adde infomation fom the incoming packet. In thi application, the output adde i uually 8 bit. Dictionay eaching - Ued to geneate the indice fom the wod in a dictionay. In an Englih wod, each alphabetic chaacte can be epeented by 5-bit. In the above application, the common popetie of the adde geneato include: Exactly one element of the domain map to a non-zeo element of the ange. Typically many element of the domain map to 0. The numbe of non-zeo output i uually much malle than that of the poible input combination. Data toed in the adde geneato can be updated. A high peed cicuit i equied. B. Adde Geneato by an LUT Cacade Functional decompoition [8] i a method whee a given function i divided into function with fewe input. Fo a given function F ( x), let x be patitioned a ( x A, x B ). The decompoition chat of F i a table with na column and nb ow, whee n A and n B ae the numbe of vaiable in x A and x B, epectively. Each column and ow ha it unique label with a binay numbe, and the coeponding element in the table denote the value of F. The column multiplicity, µ, i the numbe of diffeent column patten of the decompoition chat. By uing functional decompoition, a function F can be decompoed a F ( x A, x B )=G(H( x A ), x B ), a hown in Fig. 1, whee the numbe of ail (ignal line between two block H and G) i log µ, whee a denote the mallet intege geate than o equal to a. By iteative functional decompoition, the given function can be ealized by an LUT cacade hown in Fig., whee each cell conit of a memoy [5], [9]. whee p>and = log (k +1). C. Adde Geneato by Multiple LUT Cacade A ingle LUT cacade ealization often equie many level. Since the delay i popotional to the numbe of level in a cacade, we eek to educe the numbe of level. Accoding to (1), if we inceae p, then the numbe of level i educed, but the amount of memoy i inceaed. Howeve, a hown in Fig. 3, we can ue the multiple LUT cacade to educe the level when p i fixed. Fo an n-input adde geneation function with weight k, let the numbe of ail of each LUT cacade be. Fit, patition the et of vecto into g = k 1 goup of 1 vecto each, except the lat goup, which ha 1 o fewe vecto. Fo each goup of the vecto, fom an independent adde geneation function with the ame input. Then, fo each goup, ealize the coeponding adde geneato with a LUT cacade. Finally, ue a pecial encode to poduce the coect output of the adde geneato. Let v i (i =1,,..., g) be the i-th input of the encode (i.e., v i i the output value of the i-th LUT cacade), and let v out be the output value of the encode. Then, { 0 if vi =0 v out = v i +(i 1)( 1) if v i 0. Example.1 Fo an n-input adde geneation function with weight k, fo k = 1000 and n =3, by Theoem.1, we have =10. Let p = +1 =. When we ue a ingle LUT cacade to ealize the function, by Theoem.1, we need n p = cell, and the numbe of level of the LUT cacade i alo. Since each cell conit of a memoy with adde line and 10 output, the total amount of memoy i 10 = 450K bit. Note that uch cell doe not fit in a ingle block RAM (BRAM) of the Xilinx FPGA, which contain 18K bit. Howeve, if we ue a multiple LUT cacade to ealize the function, we can educe the numbe of level and the total amount of memoy, a well a the ize of cell to fit in the BRAM in Xilinx FPGA. Patition the et of vecto into two goup, and ealize each goup independently. Then, we need two LUT cacade. Fo each LUT cacade, the numbe of vecto i 500, o =9. Alo, let p = +=. Then, we need n p =1 cell in each cacade. Note that the numbe of level of the LUT cacade i 1 and i malle than the ingle LUT cacade ealization. Since each cell conit of a memoy with 9 output and at mot adde line, the total amount of memoy i at mot 9 1 = 44K bit. Alo, note that the ize of the memoy fo a cell i 9=18K bit. Thi jut fit the BRAM of Xilinx FPGA. Thu, the multiple LUT cacade not only educe the numbe of level and the total amount of memoy, but alo adjut the ize of cell to fit into the available memoy in the FPGA. (End of Example)

01 g1 0 1 g 0 1 g Special Encode q 01 g1 0 1 g 0 1 g +1 OR q q Fig. 3. Multiple LUT cacade achitectue. Fig. 5. Multiple LUT cacade OR achitectue. 1 p d 1 c p- 01 DIN RAM Add clk MUX d c n-(-1)p+(-) 0 DIN p RAM Add clk Fig. 4. Detailed deign of the LUT cacade. MUX d 0 TABLE I TRUTH TABLE FOR A 6-INPUT ADDRESS GENERATION FUNCTION DIN n-(-1)(p+) RAM Add clk The multiple LUT cacade achitectue i hown in Fig. 3, and the ealization with thi achitectue i called the multiple LUT cacade ealization. It conit of a goup of LUT cacade and a pecial encode. The input of each LUT cacade ae common with othe LUT cacade, while the output of each LUT cacade ae connected to the encode. The LUT cacade ae ued to ealize adde geneation function, while the encode i ued to geneate the index fom the output of cacade. Fo an n-input adde geneation function with weight k, the detailed deign of the LUT cacade in the multiple LUT cacade i hown in Fig. 4, whee x i (i =1,,..., ) denote the pimay input to the i-th cell, d i (i =1,,..., ) denote the data input to the i-th cell and povide the data value to be witten in the RAM of the i-th cell, denote the numbe of ail and log (k +1), c j (j =, 3,..., ) denote the additional input to the j-th cell and i ued to elect the RAM location along with x j fo wite acce. Note that c j and d i i epeented by bit, and the RAM except fo the lat one have p adde line, but the lat RAM ha at mot p adde line. When i high, the c j i connected to the RAM to wite the data into the RAM. When i low, the output of the RAM ae connected to the input of the ucceeding RAM, and the cicuit wok a a cacade to ealize the function. When the numbe of goup of the LUT cacade i mall, we can ue the multiple LUT cacade OR achitectue to implify the encode, a hown in Fig. 5. The ealization with thi achitectue i the multiple LUT cacade OR ealization. In thee two achitectue, only the lat cell ae diffeent. Note that in Fig. 5, the lat cell in the LUT cacade except fo the fit ow Input Output x 1 x x 3 x 4 x 5 x 6 out out 1 out 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 Othe value 0 0 0 ha moe output than that in Fig. 3. Thi i becaue that the i-th LUT cacade poduce the value anging fom (i 1)( 1) to i( 1), whee i =(1,,..., g), i the numbe of ail, and g i the numbe of goup of the LUT cacade. In thi cae, we can ue OR gate intead of the encode. Note that the total amount of memoy in the multiple LUT cacade OR achitectue i lage than the multiple LUT cacade achitectue. Howeve, in FPGA implementation, if the memoy ize of the lat cell i malle than the embedded memoy ize of the FPGA, the multiple LUT cacade OR achitectue i a good choice ince it i fate than the coeponding multiple LUT cacade ealization. Example. Table I how a 6-input adde geneation function with 6 egiteed vecto (weight 6). Single Memoy Realization: The numbe of adde line i 6 and the numbe of output i 3. Thu, the total amount of memoy i 6 3 = 19 bit. Single LUT Cacade Realization: Since the weight of the function i k = 6, by Theoem.1, the numbe of ail i = log (6 + 1) =3. Let the numbe of adde line fo the memoy in a cell be p =4. By patitioning the input into thee dijoint et {x 1,x,x 3,x 4 }, {x 5 }, and {x 6 }, we have a cacade in Fig. 6, whee the additional input to the cell ae ignoed. The total amount of memoy i 4 3 3 = 144 bit, and the numbe of level i =3. Note that it equie malle amount of memoy than the ingle memoy ealization.

x 1 x x 3 x 4 1 1 3 x 5 x 6 4 5 6 Fig. 6. Single LUT cacade ealization. x 1 x x 3 x 4 x 5 x 6 1 1 3 z x 1 x x 3 x 4 x 5 x 6 3 z 3 3 4 4 z 4 out out 1 out 0 Special Encode out out 1 out 0 TABLE II TRUTH TABLES FOR THE CELLS IN THE MULTIPLE LUT CASCADE REALIZATION x 1 x x 3 x 4 1 x 5 x 6 z 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 1 1 0 0 1 1 1 Othe value 1 1 0 0 Othe value 0 0 x 1 x x 3 x 4 3 4 x 5 x 6 z 3 z 4 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 Othe value 1 1 0 0 Othe value 0 0 TABLE III TRUTH TABLES FOR CELL 4 IN THE MULTIPLE LUT CASCADE OR REALIZATION ce (a) Multiple LUT cacade ealization ce x 1 x x 3 x 4 1 x 1 x x 3 x 4 3 1 x 5 x 6 z 3 4 x 5 x 6 z 3 z 4 4 z 5 out OR out 1 out 0 (b) Multiple LUT cacade OR ealization Fig. 7. Multiple LUT cacade ealization and multiple LUT cacade OR ealization. Multiple LUT cacade Realization: Patition Table I into two pat. Thi yield two function. Since the weight of each function i 3, the numbe of ail i log (3 + 1) =. Thu, the numbe of goup of LUT cacade i 6 1 =. Let the numbe of adde line fo the memoy in a cell alo be 4. By patitioning the input into two dijoint et {x 1,x,x 3,x 4 } and {x 5,x 6 }, we obtain the ealization in Fig. 7 (a), whee the additional input to the cell ae ignoed. The uppe LUT cacade ealize the uppe pat of the Table I, while the lowe LUT cacade ealize the lowe pat of the Table I. The content of each cell i hown in Table II. The encode geneate the index (out,out 1,out 0 ) fom the pai of output, (,z ) and (z 3,z 4 ): 3 4 x 5 x 6 z 3 z 4 z 5 0 0 1 1 1 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 0 Othe value 0 0 0 out = z 3 z 4, out 1 = z 3 z 4, out 0 = z z 3 z 4. The total amount of memoy i 4 4 = 18 bit, and the numbe of level i. Note that the multiple LUT cacade ealization ue le memoy and fewe level than the ingle LUT cacade ealization. Multiple LUT cacade OR Realization: The deign method i imila to the multiple LUT cacade ealization. Fig. 7 (b) how the achitectue, whee the additional input to the cell ae ignoed. Table III how the content of the cell 4 only, ince the content of the othe cell ae the ame a the multiple LUT cacade ealization. The output pat i imple, i.e., out = z 3, out 1 = z 4, out 0 = z z 5. Howeve, the total amount of memoy i 4 3+ 4 3 = 144 bit that i lage than the multiple LUT cacade ealization. (End of Example) A. Xilinx CAM III. OTHER REALIZATIONS Xilinx Coe Geneato ytem [10] povide two pecial deign fo the CAM. One deign only ue the lice called SRL16 Implementation. The othe deign ue block RAM called Block SelectRAM Implementation. Both deign fo the CAM

Regite XNOR n AND Input Encode Fig. 8. Realize the adde geneato with egite and gate. log (k+1) Output TABLE IV FOUR REALIZATIONS USING MULTIPLE LUT CASCADE Deign Vecto p Goup Level 3p1 63 3 1 9 5 4p1 60 4 1 4 6 4p1OR 60 4 1 4 6 5p 6 5 8 5pOR 6 5 8 6p 63 6 1 9 : Numbe of ail p: Numbe of input of the RAM in a cell Goup: Numbe of LUT cacade ae paameteizable IP (intellectual popety) coe. The SRL16 Implementation equie no BRAM, but equie a longe clock cycle time than the Block SelectRAM Implementation []. The Block SelectRAM Implementation ue the block RAM a a dual-pot RAM [6], while the multiple LUT cacade ue the block RAM jut a a ingle-pot RAM. It i inteeting to compae the multiple LUT cacade with the Xilinx CAM implemented by Block SelectRAM Implementation on the ame FPGA. We ue the following paamete povided by the Xilinx Coe Geneato to cutomize the coe fo the CAM: ce Fig. 9. Achitectue fo 5p. 6 5 5 5 5 01 0 08 6 5 5 5 5 1 18 Special Encode 6 Block SelectRAM Implementation. Depth- Numbe of wod (vecto) toed in the CAM: k. Data width- Width of the data wod (vecto) toed in the CAM: n. Match Adde Type- Thee option: Binay Encoded, Single-match Unencoded, and Multi-match Unencoded. We ued the Binay Encoded option. B. Regite and Gate The ealization in Fig 8 diectly implement the adde geneato by egite and gate. The egite toe the egiteed vecto. The excluive NOR (XNOR) gate and the AND gate fom an equivalence cicuit whoe output i 1 iff the toed vecto i identical to the input vecto. Fo an n-input adde geneato with one egiteed vecto, we need an n-bit egite, n copie of XNOR gate, and an n- input AND gate. Fo an n-input adde geneato with k egiteed vecto, we need k copie of n-bit egite, nk copie of XNOR gate, and k copie of n-input AND gate. In addition, we need an encode with k input and log (k +1) output to geneate the output adde. Thi cicuit can be conideed a a pecial cae of the multiple LUT cacade achitectue, whee =1, p =, and g = k. IV. FPGA IMPLEMENTATIONS We implemented the adde geneato with 48 input and 60 63 egiteed vecto on a Xilinx Spatan-3 FPGA (Xc34000-5) uing the multiple LUT cacade, Xilinx Coe Geneato, and egite & gate. The FPGA device Xc34000-5 ha 96 BRAM and 7648 lice. Each BRAM contain 18K bit and each lice conit of two 4-input look-up table. Fo each implementation, we decibed the cicuit by Veilog HDL, and then ued the Xilinx ISE 7.1i deign oftwae to yntheize and place & oute. Fit, we ued the multiple LUT cacade to ealize the adde geneato. To ue the BRAM in the FPGA efficiently, the memoy ize of a cell in the LUT cacade hould not exceed the BRAM ize. Let p be numbe of adde line of the memoy in the cell. Since each BRAM contain 9 bit, we have the following elation: p 9, whee i the numbe of ail. Thu, we have the following: p = log (9/) +, whee a denote the laget intege le than o equal to a. We deigned ix kind of adde geneato 3p1, 4p1, 4p1o, 5p, 5po and 6p a hown in Table IV, whee the column Vecto denote the numbe of egiteed vecto, the column denote the numbe of ail, the column p denote the numbe of input of the RAM in a cell, the column Goup denote the numbe of LUT cacade, and the column Level denote the numbe of level o cell in the LUT cacade. Among thee ix deign, 4p1OR and 5pOR ae multiple LUT cacade OR ealization and the othe ae multiple LUT cacade ealization. Note that 3p1 contain 9 goup, and the RAM ize of each cell i 1 3 = 1K bit. We can etimate that if we ue the multiple LUT cacade OR achitectue fo 3p1, we need 7 moe BRAM. Thu, 3p1 i unuitable fo the multiple LUT cacade OR achitectue. To illutate Table IV, conide the deign of 5p in Fig 9, whee the additional input to the cell ae ignoed. Fo 5p,

TABLE V COMPARISONS OF FPGA IMPLEMENTATIONS OF THE ADDRESS GENERATOR Deign Level Slice Memoy F clk tco/tpd Th. Th./Slice Th./Memoy Delay (BRAM) (MHz) (n) (Mbp) ( Mbp lice ) ( Mbp BRAM ) (n) 3p1 5 87 45 107.7 1.831 643 7.37 14.30 68.461 4p1 6 58 4 10.638 19.301 616 10.6 5.66 77.759 4p1OR 6 48 4 0.05 18.03 661 13.78 7.55 7.647 5p 8 4 16 17.50 19.835 765 18.1 47.81 8.579 5pOR 8 41 16 136.147 19.49 817 (bet) 19.9 51.06 78.189 6p 9 4 9 131.8 15.14 791 3.96 (bet) 87.88 (bet) 83.416 Xilinx CAM 1 414 1 74.8 9.0 449 1.08 37.41 4.378 Reg-Gate 3016 36.59 (tpd) 36.59 (bet) ince the numbe of ail i = 5, the numbe of goup i 6 5 1 =. Thu, we need two LUT cacade and an encode. Since each LUT cacade conit of 8 cell, the level of 5p i 8. To efficiently ue BRAM in the FPGA, the numbe of input of the RAM in the cell i p = log (9/5) + =. Note that the numbe of the pimay input to the lat cell i 1; thi i becaue 48 6 6=1. Although the RAM in the lat cell ha 6 input, it till equie one BRAM. A fo the pecial encode, let v 1 be the value of the output of the uppe LUT cacade, let v be the value of the output of the lowe LUT cacade, and let v out be the value of the output of the encode. Then, { v1 if v v out = 1 0 v +31 if v 0. The pecial encode equie 6 lice fom the FPGA. Afte yntheizing and mapping, 5p equied 16 BRAM and 4 lice. Fom Table IV, we can ee that deceaing, inceae the goup needed to implement the function, but deceae the level in the cacade. Next, we ued the Xilinx Coe Geneato ytem to poduce a Xilinx CAM with 48 input and 63 egiteed vecto. Afte yntheizing and mapping, Xilinx CAM equied 1 BRAM and 414 lice. Note that Xilinx CAM equie one clock cycle to find a match. Finally, we deigned the adde geneato Reg-Gate by uing egite and gate hown in Fig 8. Note that the numbe of input i n =48and the numbe of output i log (k +1) = log (63 + 1) = 6. Afte yntheizing and mapping, it equied 3016 lice. V. PERFORMANCE AND COMPARISONS In thi ection, we evaluate the pefomance of the multiple LUT cacade ealization and the multiple LUT cacade OR ealization (i.e., 3p1, 4p1, 4p1o, 5p, 5po and 6p), and compae them with Xilinx CAM and Reg-Gate. In Table V, the column Level denote the numbe of level, the column Slice denote the numbe of occupied lice, the column Memoy denote the amount of utilized memoy, the column F clk denote the maximum clock ate, the column tco denote the maximum time to obtain the output afte clock, and the column tpd denote the maximum popagation time fom the input to the output. The column Th. denote the maximum thoughput. Since the adde geneato ha 6 output, it i calculated by: Th. = 6 F clk. Fo Reg-Gate, Delay denote the maximum delay fom the input to the output and i equal to tpd. Fo the multiple LUT cacade ealization, the multiple LUT cacade OR ealization and Xilinx CAM, Delay denote the total delay, and i calculated by: 1000 Level Delay = + tco, F clk whee 1000 i a unit conveion facto. The column Th./Slice denote the thoughput pe lice, and the column Th./Memoy denote the thoughput pe memoy. In Table V, the value denoted with bet how the bet eult. Reg-Gate ha the mallet delay, but equie many lice. Note that Reg-Gate equie no clock pule in the adde geneation opeation, while the othe ae equential cicuit equiing at leat one clock pule. Since delay of Reg-Gate i 36.59 n, the equivalent thoughput i (1000/36.59) 6 = 164.5 (Mbp); thi i lowe than all othe. 5pOR ha the highet thoughput in Table V. All of the multiple LUT cacade ealization, and the multiple LUT cacade OR ealization have highe thoughput, and highe thoughput/lice than Xilinx CAM. In tem of thoughput/memoy, 5p and 6p ae bette than Xilinx CAM. 3p1 ha the mallet delay in the ealization uing the multiple LUT cacade, but i lowe than Xilinx CAM. 6p ha 76% moe thoughput, 9.5 time moe thoughput/lice, 1.35 time moe thoughput/memoy, but 97% moe

delay than Xilinx CAM. It i inteeting to conide the elation of delay with the aea fo both the utilized lice and the utilized BRAM. Let α be an aea facto between the BRAM and the lice, i.e., (aea fo one BRAM) = α (aea fo one lice). If α 64, then 6p i moe efficient than Xilinx CAM in tem of delay aea poduct. VI. CONCLUSIONS In thi pape, we peented the multiple LUT cacade to ealize adde geneato. We illutated the deign method by adde geneato with n =48and k =63. Howeve, it can be extended to any value of n and k. We implemented ix kind of adde geneato (i.e. 3p1, 4p1, 4p1o, 5p, 5po and 6p) on the Xilinx Spatan-3 FPGA (Xc34000-5) by uing the multiple LUT cacade. Fo compaion, we alo implemented Xilinx CAM by uing the Xilinx Coe Geneato and Reg-Gate by uing egite and gate on the ame type of FPGA. Reg-Gate ha the mallet delay, but equie many lice. All of the implementation of the multiple LUT cacade have highe thoughput, and highe thoughput/lice than Xilinx CAM. In tem of thoughput/memoy, 5p and 6p ae bette than Xilinx CAM. 6p ha 76% moe thoughput, 9.5 time moe thoughput/lice, 1.35 time moe thoughput/memoy than Xilinx CAM. In addition, if the aea fo one BRAM i le than o equal to the aea fo 64 lice, 6p i moe efficient than Xilinx CAM in tem of delay aea poduct although it ha 97% moe delay than Xilinx CAM. [6] Xilinx, Inc., Uing block RAM fo high-pefomance ead/wite CAM, XAPP04 (v1.), May, 000, available at http://www.xilinx.com/poduct/deign eouce/ mem cone/eouce/intenal cam.htm [7] T. Saao, Deign method fo multiple-valued input adde geneato, Intenational Sympoium on Multiple- Valued Logic, May 006, Singapoe, (to be publihed). [8] T. Saao, Switching Theoy fo Logic Synthei, Kluwe Academic Publihe, pp. 4-46, 1999. [9] K. Nakamua, T. Saao, M. Matuua, K. Tanaka, K. Yohizumi, H. Qin, and Y. Iguchi, Pogammable logic device with an 8-tage cacade of 64K-bit aynchonou SRAM, Cool Chip VIII, IEEE Sympoium on Low- Powe and High-Speed Chip, Apil 0-, 005, Yokohama, Japan. [10] http://www.xilinx.com/poduct/deign eouce/deign tool/gouping/deign enty.htm [] Xilinx, Inc., Content-Addeable Memoy v5.1, DS53, Nov., 004, available at http://www.xilinx.com/ ipcente/catalog/logicoe/doc/cam.pdf ACKNOWLEDGMENTS Thi eeach i patly uppoted by the Gant in Aid fo Scientific Reeach of JSPS, MEXT, and the Gant of Kitakyuhu aea innovative clute poject. Dicuion with Pof. J. T. Bulte impoved peentation of the pape. REFERENCES [1] M. A. Ruiz-Sanchez, E. W. Bieack, and W. Dabbou, Suvey and taxonomy of IP lookup algoithm, IEEE Netwok, Vol. 15, No., pp. 8-3, Ma.-Ap. 001. [] F. Shafai, K.J. Schultz, G.F.R. Gibon, A.G. Bluchke, and D.E. Somppi, Fully paallel 30-MHz,.5-Mb CAM, IEEE Jounal of Solid-State Cicuit, Vol. 33, No., pp. 1690-1696, 1998. [3] K. Pagiamtzi and A. Sheikholelami, A low-powe content-addeable memoy (CAM) uing pipelined hieachical each cheme, IEEE Jounal of Solid-State Cicuit, Vol. 39, No. 9, pp. 151-1519, 004. [4] http://www.xilinx.com [5] T. Saao, M. Matuua, and Y. Iguchi, A cacade ealization of multiple-output function fo econfiguable hadwae, Intenational Wokhop on Logic and Synthei (IWLS01), Lake Tahoe, CA, June 1-15, 001. pp. 5-30.