On the Design of LPM Address Generators Using Multiple LUT Cascades on FPGAs

Similar documents
Design of Address Generators Using Multiple LUT Cascade on FPGA

A Reconfigurable Frame Interpolation Hardware Architecture for High Definition Video

H-DFT: A HYBRID DFT ARCHITECTURE FOR LOW-COST HIGH QUALITY STRUCTURAL TESTING

Compact Beamformer Design with High Frame Rate for Ultrasound Imaging

4.5 Pipelining. Pipelining is Natural!

Ranking Fuzzy Numbers by Using Radius of Gyration

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /VETECF.2002.

R&D White Paper WHP 119. Mezzanine Compression for HDTV. Research & Development BRITISH BROADCASTING CORPORATION. September R.T.

A QUERY BY HUMMING SYSTEM THAT LEARNS FROM EXPERIENCE

Study on evaluation method of the pure tone for small fan

Melodic Similarity - a Conceptual Framework

e-workbook TECHNIQUES AND MATERIALS OF MUSIC Part I: Rudiments

Scalable Music Recommendation by Search

Version Capital public radio. Brand, Logo and Style Guide

A 0.8 V T Network-Based 2.6 GHz Downconverter RFIC

Precision Interface Technology

RBM-PLDA subsystem for the NIST i-vector Challenge

Grant Spacing Signaling at the ONU

Precision Interface Technology

Experimental Investigation of the Effect of Speckle Noise on Continuous Scan Laser Doppler Vibrometer Measurements

Stochastic analysis of Stravinsky s varied ostinati

An Update Method for a Low Power CAM Emulator using an LUT Cascade Based on an EVMDD (k)

A METRIC FOR MUSIC NOTATION TRANSCRIPTION ACCURACY

A Low Cost Scanning Fabry Perot Interferometer for Student Laboratory

C2 Vectors C3 Interactions transfer momentum. General Physics GP7-Vectors (Ch 4) 1

Music Technology Advanced Subsidiary Unit 1: Music Technology Portfolio 1

LUT Optimization for Distributed Arithmetic-Based Block Least Mean Square Adaptive Filter

Implementation of Dynamic RAMs with clock gating circuits using Verilog HDL

Flagger Control for Resurfacing or Moving Operation. One-Lane Two-Way Operation

LISG Laser Interferometric Sensor for Glass fiber User's manual.

EWCM 900. technical user manual. electronic controller for compressors and fans

Design and FPGA Implementation of 100Gbit/s Scrambler Architectures for OTN Protocol Chethan Kumar M 1, Praveen Kumar Y G 2, Dr. M. Z. Kurian 3.

A Fast Constant Coefficient Multiplier for the XC6200

(2'-6") OUTLINE OF REQUIRED CLEAR SERVICE AREA

Language and Music: Differential Hemispheric Dominance in Detecting Unexpected Errors in the Lyrics and Melody of Memorized Songs

Flagger Control for Resurfacing or Moving Operation. One-Lane Two-Way Operation

MARTIN KOLLÁR. University of Technology in Košice Department of Theory of Electrical Engineering and Measurement

CLASSIFICATION OF RECORDED CLASSICAL MUSIC USING NEURAL NETWORKS

Deal or No Deal? Decision Making under Risk in a Large-Payoff Game Show

Citrus Station Mimeo Report CES WFW-Lake Alfred, Florida Lake Alfred, Florida Newsletter No. 2 6.

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

Midterm Exam 15 points total. March 28, 2011

CpE 442. Designing a Pipeline Processor (lect. II)

The main design objective in adder design are area, speed and power. Carry Select Adder (CSLA) is one of the fastest

EECS 270 Midterm 2 Exam Closed book portion Fall 2014

March 13, :36 vra80334_appe Sheet number 1 Page number 893 black. appendix. Commercial Devices

Chapter 1: Choose a Research Topic

An Efficient Reduction of Area in Multistandard Transform Core

Faculty of Electrical & Electronics Engineering BEE3233 Electronics System Design. Laboratory 3: Finite State Machine (FSM)

LFSRs as Functional Blocks in Wireless Applications Author: Stephen Lim and Andy Miller

other islands for four players violin, soprano sax, piano & computer nick fells 2009

A Method to Decompose Multiple-Output Logic Functions

Fully Pipelined High Speed SB and MC of AES Based on FPGA

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Music from an evil subterranean beast

The game of competitive sorcery that will leave you spellbound.

SN54273, SN54LS273, SN74273, SN74LS273 OCTAL D-TYPE FLIP-FLOP WITH CLEAR

SN54273, SN54LS273, SN74273, SN74LS273 OCTAL D-TYPE FLIP-FLOP WITH CLEAR

Polar Decoder PD-MS 1.1

Content-Based Movie Recommendation Using Different Feature Sets

GS4882, GS4982 Video Sync Separators with 50% Sync Slicing

University of California at Berkeley College of Engineering Department of Electrical Engineering and Computer Science SOLUTIONS

Implementation of Low Power and Area Efficient Carry Select Adder

FPGA Implementation of Sequential Logic

Design of Memory Based Implementation Using LUT Multiplier

Field Programmable Gate Arrays (FPGAs)

An Efficient High Speed Wallace Tree Multiplier

OF AN ADVANCED LUT METHODOLOGY BASED FIR FILTER DESIGN PROCESS

Keywords Xilinx ISE, LUT, FIR System, SDR, Spectrum- Sensing, FPGA, Memory- optimization, A-OMS LUT.

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

Reconfigurable FPGA Implementation of FIR Filter using Modified DA Method

Solution to Digital Logic )What is the magnitude comparator? Design a logic circuit for 4 bit magnitude comparator and explain it,

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Making Fraction Division Concrete: A New Way to Understand the Invert and Multiply Algorithm

Flip Flop. S-R Flip Flop. Sequential Circuits. Block diagram. Prepared by:- Anwar Bari

RELATED WORK Integrated circuits and programmable devices

FPGA Laboratory Assignment 4. Due Date: 06/11/2012

Texas Bandmasters Association 2016 Convention/Clinic

VLSI Technology used in Auto-Scan Delay Testing Design For Bench Mark Circuits

ABOVE CEILING. COORDINATE WITH AV INSTALLER FOR INSTALLATION OF SURGE SUPRESSION AND TERMINATION OF OUTLET IN CEILING BOX

The Design of Efficient Viterbi Decoder and Realization by FPGA

Area-efficient high-throughput parallel scramblers using generalized algorithms

ECE 263 Digital Systems, Fall 2015

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Design and Implementation of High Speed 256-Bit Modified Square Root Carry Select Adder

Instructions. Final Exam CPSC/ELEN 680 December 12, Name: UIN:

Memory Efficient VLSI Architecture for QCIF to VGA Resolution Conversion

Memory efficient Distributed architecture LUT Design using Unified Architecture

L11/12: Reconfigurable Logic Architectures

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

VGA Controller. Leif Andersen, Daniel Blakemore, Jon Parker University of Utah December 19, VGA Controller Components

Switching Theory And Logic Design UNIT-IV SEQUENTIAL LOGIC CIRCUITS

EECS 270 Group Homework 4 Due Friday. June half credit if turned in by June

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

International Journal of Scientific & Engineering Research, Volume 5, Issue 9, September ISSN

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Reduction of Clock Power in Sequential Circuits Using Multi-Bit Flip-Flops

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

ALONG with the progressive device scaling, semiconductor

Transcription:

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Intenational Jounal of Electonics Vol. **, No. **, ** 006, 1 18 On the Design of LPM Addess Geneatos Using Multiple LUT Cascades on FPGAs Hui Qin a), Tsutomu Sasao b),andjont.butle c) Depatment of Compute Science and Electonics, Kyushu Institute of Technology, 680 4, Kawazu, Iizuka, Fukuoka, 80 850, Japan Depatment of Electical and Compute Engineeing, Naval Postgaduate School, Code EC/Bu, Monteey, CA 343-511 (Received Novembe 6, 006) We popose the multiple LUT cascade as a means to configue an n-input LPM (Longest Pefix Match) addess geneato commonly used in outes to detemine the output pot given an addess. The LPM addess geneato accepts n-bit addesses which it matches against k stoed pefixes. We implement ou design on a Xilinx Spatan-3 FPGA fo n =3andk = 504 511. Also, we compae ou design to a Xilinx popietay TCAM (tenay content-addessable memoy) design and to anothe design we popose as a likely solution to this poblem. Ou best multiple LUT cascade implementation has 5.17 times moe thoughput, 40.71 times moe thoughput/aea and is.7 times moe efficient in tems of aea-delay poduct than Xilinx s popietay design, but its aea is only 15% of Xilinx s design. Futhemoe, we deive a method to detemine the optimum configuation of the multiple LUT cascade on an FPGA. Keywods: LPM addess geneato; Multiple LUT cascade; FPGA 1 Intoduction The need fo highe intenet speeds is likely to be the subject of intense inteest fo many yeas to come. A netwok s speed is diectly elated to the speed with which a node can switch a packet fom an input pot to an output pot. This, in tun, depends on how fast a packet s addess can be accessed in memoy. The longest pefix match (LPM) poblem is one of detemining the output pot addess fom a list of pefix vectos stoed in memoy. Fo example, if the pefix vecto 01001**** is stoed in memoy, then the packet addess 010011111 matches this enty. That is, each bit in the packet addess matches Email: a) qinhui@aies01.cse.kyutech.ac.jp; b) sasao@cse.kyutech.ac.jp; c) jon butle@msn.com Intenational Jounal of Electonics ISSN 000-717 pint/ ISSN 136-3060 online http://www.tandf.co.uk/jounals DOI:10.1080/000710xxxxxxxxxxxxx c 006 Taylo & Fancis Ltd

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Hui Qin, Tsutomu Sasao, and Jon T. Butle exactly the coesponding bit in the pefix vecto o thee is a * o don t cae in that position. If othe stoed pefixes match the packet addess, then the pefix with the least don t cae values detemines the output pot addess. That is, the memoy enty coesponding to the longest pefix match detemines the output pot. An ideal device fo this application is a tenay content-addessable memoy (TCAM) (Pagiamtzis et al. 006, Song et al. 005). The descipto tenay efes to the thee values stoed, 0, 1, and *. In (Kasnavi et al. 005), the authos poposed pipelined TCAMs fo the longest pefix match to incease TCAM efficiency. In (Wang et al. 005), the authos used a TCAM and a small DRAM fo the longest pefix match to educe the equied size of TCAM. Unfotunately, TCAM still dissipates moe powe than standad RAM (Renesas 005). Seveal authos have poposed the use of standad RAM in LPM design. Gupta, Lin, and McKeown showed a mechanism to pefom LPM evey memoy access (Gupta et al. 18). Dhamapuika, Kishnamuthy, and Taylo have poposed the use of Bloom filtes to solve the LPM poblem (Dhamapuika et al. 003). Sasao and Butle have shown that a fast, powe-efficient TCAM ealization using a look-up table (LUT) cascade (Sasao et al. 006). In this pape, we popose an extension to the LUT cascade ealization: a multiple LUT cascade ealization that consists of multiple LUT cascades connected to a special encode. This offes even moe efficient ealizations in an achitectue that is moe easily econfigued when additional pefix vectos ae placed in the pefix table. We have implemented six LPM addess geneatos on the Xilinx Spatan-3 FPGA (XC3S4000-5): Fou using multiple LUT cascades, one using Xilinx s TCAM ealization based on the Xilinx IP coe, and one using egistes and gates. In addition, we compae the six LPM addess geneatos on the basis of delay, delay-aea poduct, thoughput, thoughput/aea, and FPGA esouces used. A peliminay vesion of this pape was pesented at ARC006 (Qin et al. 006). We extend these esults by intoducing the optimum configuation of the multiple LUT cascade, and by showing how to ealize the optimum multiple LUT cascade on an FPGA. The est of the pape is oganized as follows: Section descibes the multiple LUT cascade. Section 3 shows othe ealizations fo the LPM addess geneatos. Section 4 pesents the implementations of the LPM addess geneato using an FPGA. Section 5 shows the expeimental esults. Section 6 discusses the optimum configuation of the multiple LUT cascade implemented on an FPGA. And finally, Section 7 concludes the pape.

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos 3 Table 1. LPM table. Addess Pefix Vecto 1 1000 010* 3 01** 4 1*** 5 0*** Table. LPM function. Output Output Input Addess Input Addess 0000 5 1000 1 0001 5 1001 4 0010 5 1010 4 0011 5 1011 4 0100 1100 4 0101 1101 4 0110 3 1110 4 0111 3 1111 4 Multiple LUT cascades.1 LPM addess geneatos A content-addessable memoy (CAM) (Shafai et al. 18) stoes 0 s and 1 s and poduces the addess of the given data. A TCAM, unlike a CAM, stoes 0 s, 1 s, and * s, whee * is a don t cae value that matches both 0 and 1. TCAMs ae extensively used in outing tables fo the intenet. A outing table specifies an inteface identifie coesponding to the longest pefix that matches an incoming packet, in a pocess called Longest Pefix Match (LPM). In the LPM table, the tenay vectos have esticted pattens: the pefix consists of only 0 s and 1 s, and the postfix consists of only * s (don t caes). In this pape, this type of vecto is called a pefix vecto. Definition.1 An n-input m-output k-enty LPM table stoes kn-element pefix vectos. To assue that the longest pefix addess is poduced, TCAM enties ae stoed in descending pefix length, and the fist match stating fom the top of the table detemines the LPM table s output. An addess is an m- element binay vecto fo m = log (k +1), whee a denotes the smallest intege geate than o equal to a. The coesponding LPM function is a logic function f : B n B m, whee f( x) is the smallest addess of an enty that is identical to x except possibly fo don t cae values. If no such enty exists, f( x)=0 m. The LPM addess geneato is a cicuit that ealizes the LPM function. Example. Table 1 shows an LPM table with 5 4-element pefix vectos. Table shows the coesponding LPM function. It has 16 enties, one fo each 4-bit input. The output addess is stoed fo each input coesponding to the addess of the longest pefix vecto that matches it.. An LUT cascade ealization of LPM addess geneatos An LPM function, such as that shown in Table, can be ealized by a single memoy which opeates as a pogammable combinational logic cicuit.

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 4 Hui Qin, Tsutomu Sasao, and Jon T. Butle Howeve, this often equies pohibitively lage memoy size. Theoem.3 (Sasao 006) An n-input LPM addess geneato with k pefix vectos can be ealized by an LUT cascade, whee each cell ealizes a p-input, -output combinational logic function. Let s be the necessay numbe of levels o cells. Then, s whee p>and = log (k +1). n, (1) p.3 LPM addess geneatos using the multiple LUT cascade A single LUT cascade ealization of an LPM function often equies many levels. Since the delay is popotional to the numbe of levels in a cascade, we wish to educe the numbe of levels. Accoding to (1), if we incease p, the numbe of inputs to each cell, then the numbe of levels s is educed. Fo each incease by 1 of p, the memoy needed to ealize the cell is doubled. Howeve, as shown in Figue 1, we can use a multiple LUT cascade to educe the numbe of levels s while keeping p fixed. Fo an n-input LPM function with k pefix vectos, let the numbe of ails of each LUT cascade be. Fist, stating at the top of the LPM table, patition the set of pefix vectos into g goups of 1 vectos each, except the last goup, which has 1 o fewe vectos, k 1 whee g =. Fo each goup of pefix vectos, fom an independent LPM function. Next, patition the set of n inputs into s goups. The inputs within a goup will apply to a single cell within each cascade. Then, ealize each LPM function by an LUT cascade. Thus, we need a total of g LUT cascades, whee each LUT cascade consists of s cells. Finally, use a special encode to poduce the LPM addess. Let v i (i =1,,..., g) be the i-th input of the special encode fom the i-th LUT cascade, and let v out be the output value of the special encode. That is, v i is the output value of the i-th LUT cascade, whee its binay output values ae viewed as a standad binay numbe. Similaly, v out is the output of the special encode, whee its binay output values ae viewed as a standad binay numbe. Then, we have the elation: v out = { vi +(i 1)( 1) if v i 0 and v j =0foall1 j i 1 0 if v i =0foall1 i g. Note that v out is the position of a pefix vecto v in the complete LPM table, while i is the index to the LUT cascade stoing v. (i 1)( 1) is the position in the LPM table of the last enty of the pevious (i 1)-th LUT cascade o

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos 5 is 0 in the case of the fist LUT cascade. Adding v i to this yields the position of v in the complete LPM table. Example.4 Conside an n-input LPM function with k pefix vectos. When k = 1000 and n = 3, by Theoem.3, we have = 10. Let p = + 1 = 11. When we use a single LUT cascade to ealize the function, by Theoem.3, we need n p = cells, and the numbe of levels of the LUT cascade is also. Since each cell has 11 addess lines and 10 outputs, the total memoy size needed to ealize the cascade is 11 10 = 450, 560 bits. Note that the memoy size of each cell, 11 10 = 0, 480 bits, is too lage to be ealized by a single block RAM (BRAM) of ou FPGA, which stoes 18, 43 bits. Howeve, if we use a multiple LUT cascade to ealize the function, we can educe the numbe of levels and the total memoy. Also, the cells will fit into the BRAMs in the FPGAs. Patition the set of vectos into two goups, and ealize each goup independently; this equies two LUT cascades. Fo each LUT cascade, the numbe of vectos is 500, so we have =. Also, let p = + = 11. Then, we need n p = 1 cells in each cascade. Note that the numbe of levels of the LUT cascades is 1, which is smalle than the needed in the single LUT cascade ealization. Since each cell consists of a memoy with outputs and at most 11 addess lines, the total memoy size is at most 11 1 = 44, 368 bits. Also, note that the size of the memoy fo a single cell is 11 =18, 43 bits. This fits exactly in the BRAMs of the FPGAs. Thus, the multiple LUT cascade not only educes the numbe of levels and the total memoy, but also educes the size of cells to fit into the available memoy in the FPGAs. Fig. 1 shows the multiple LUT cascade ealization. It consists of multiple LUT cascades and a special encode. The inputs of each LUT cascade ae common with othe LUT cascades, while the outputs of each LUT cascade ae connected to the special encode. Each LUT cascade ealizes an LPM function, while the special encode geneates the LPM addess fom the outputs of cascades. The detailed design of each LUT cascade is shown in Fig.. Hee x i (i = 1,,..., s) denotes the pimay inputs to the i-th cell, d i (i =1,,..., s) denotes the data inputs to the i-th cell and povides the data value to be witten in the RAM of the i-th cell, denotes the numbe of ails, whee log ( k g +1), c j (j =, 3,..., s) denotes the additional inputs to the j-th cell and is used to select the RAM location along with x j fo wite access. Note that c j and d i ae epesented by bits. All RAMs except pehaps the last one have p addess lines; the last RAM has at most p addess lines. When WE is high, c j is connected to the RAM though a MUX, allowing data to be witten into

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 6 Hui Qin, Tsutomu Sasao, and Jon T. Butle 01 11 g1 0 1 g 0s 1s gs Special Encode q Figue 1. Achitectue of the multiple LUT cascade. x 1 p c x p- c s x s n-(s-1)p+(s-) d 1 01 DIN RAM Add WE clk WE Clock MUX d 0 DIN RAM Add WE clk MUX d s 0s DIN RAM Add WE clk Figue. Detailed design of the LUT cascade. Table 3. 6-enty LPM table. Addess Pefix Vecto 1 100000 10010* 3 1010** 4 101*** 5 10**** 6 1***** Table 4. Tuth table fo the coesponding LPM function. Input Output LUT x 1 x x 3 x 4 x 5 x 6 out out 1 out 0 Cascade 1 0 0 0 0 0 0 0 1 Uppe 1 0 0 1 0 * 0 1 0 s 1 1 0 1 0 * * 0 1 1 and 1 0 1 1 * * 1 0 0 Lowe 1 0 0 * * * 1 0 1 s 3 1 1 * * * * 1 1 0 and 4 the RAMs. When WE is low, the outputs of the RAMs ae connected to the inputs of the succeeding RAMs though a MUX, and the cicuit is a cascade that ealizes the LPM function. Note that the RAMs ae synchonous RAMs. Theefoe, the LUT cascade esembles a shift egiste. Example.5 Table 3 shows a 6-input 3-output 6-enty LPM table, and the tuth table of the coesponding LPM function is shown in Table 4. Note that the enties in the two tables ae simila. Table 4 is a compact tuth table, showing only non-zeo outputs. Its input combinations must be disjoint. Thus, the two tables ae the same except fo thee enties.

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos 7 x 1 x x 3 x 4 x 5 x 6 x 1 x x 3 x 4 x 5 x 6 1 y 1 y y 3 y 4 y 5 y 6 WE Clock WE out out 1 3 out 0 CE (a) Single LUT cascade ealization 1 y 1 y z 1 z z 1 y x 1 x x 3 x 4 x 5 x 6 Clock y 3 3 y 4 4 z 3 z 4 Special Encode out out 1 out 0 (b) Multiple LUT cascade ealization Figue 3. Single LUT cascade ealization and the multiple LUT cascade ealization. Single Memoy Realization: The numbe of addess lines is 6, and the numbe of outputs is 3. Thus, the total amount of memoy is 6 3 = 1 bits. Single LUT Cascade Realization: Since thee ae k = 6 pefix vectos, by Theoem.3, the numbe of ails is = log (6 + 1) = 3. Let the numbe of addess lines fo the memoy in a cell be p = 4. By patitioning the inputs into thee disjoint sets {x 1,x,x 3,x 4 }, {x 5 },and{x 6 }, we have the cascade in Fig. 3 (a). Fo simplicity, only the signal lines fo the cascade ealization ae shown. Othe lines, such as fo stoing data, ae omitted. The total amount of memoy is 4 3 3 = 144 bits, and the numbe of levels is s = 3. Note that the single LUT cascade equies 75% of the memoy needed in the single memoy ealization. Multiple LUT Cascade Realization: Patition Table 3 into two pats, each with thee pefix vectos. The numbe of ails in the LUT cascades associated with each sepaate LPM table is log (3 + 1) =. Let the numbe of addess lines fo the memoy in a cell be p = 4. By patitioning the inputs into two disjoint sets {x 1,x,x 3,x 4 } and {x 5,x 6 }, we obtain the ealization in Fig. 3 (b). The uppe LUT cascade ealizes the uppe pat of the Table 4, while the lowe LUT cascade ealizes the lowe pat of the Table 4. The contents of each cell is shown in Table 5. Let v 1 be the output value of the uppe LUT cascade, let v be the output value of the lowe LUT cascade, and let v out be the output value of the special encode. Then, in Table 5, (z 1,z ) viewed as a standad binay numbe, has value v 1, while (z 3,z 4 ) viewed as a standad binay numbe, has value v. The special encode geneates the LPM addess fom the pai of outputs, (z 1,z ) and (z 3,z 4 ): out = z 1 z (z 3 z 4 ),

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 8 Hui Qin, Tsutomu Sasao, and Jon T. Butle Table 5. Tuth tables fo the cells in the multiple LUT cascade ealization. 1 and (uppe LUT cascade) 3 and 4 (lowe LUT cascade) x 1 x x 3 x 4 y 1 y x 5 x 6 z 1 z v 1 v out x 1 x x 3 x 4 y 3 y 4 x 5 x 6 z 3 z 4 v v out 1 0 0 0 0 0 0 0 0 1 1 001 1 0 1 1 0 0 * * 0 1 1 100 1 0 0 1 0 1 0 * 1 0 010 1 0 0 * 0 1 * * 1 0 101 1 0 1 0 1 0 * * 1 1 3 011 1 1 * * 1 0 * * 1 1 3 110 Othe values 1 1 * * 0 0 0 Othe values 1 1 * * 0 0 0 Othe values 0 0 0 Othe values 0 0 0 depends on values fom the othe LUT cascade. out 1 = z 1 z z 3 z 4, out 0 = z z 1 z 3 z 4. Note that (out,out 1,out 0 ) viewed as a standad binay numbe, has value v out coesponding to the addess in Table 3. The total memoy size is 4 4= 18 bits, and the numbe of levels is. Note that the multiple LUT cascade ealization equies 8% of the memoy and one fewe level than the single LUT cascade ealization. 3 Othe ealizations 3.1 Xilinx s TCAM Xilinx (Website of Xilinx) povides a popietay ealization of a TCAM that is poduced by the Xilinx CORE Geneato tool. Since a TCAM can diectly ealize an LPM addess geneato, we compae ou poposed multiple LUT cascade ealization with Xilinx s TCAM. In the Xilinx CORE Geneato 7.1i, we used the following paametes to poduce TCAMs. Implementation: SRL16. Mode: Standad tenay mode to geneate a standad tenay CAM. Depth: k, the numbe of wods o vectos stoed in the TCAM. Data width: n, the numbe of bits in wods o vectos. Match Addess Type: Binayencoded. Addess Resolution: Lowest. 3. Registes and gates We also compae ou poposed multiple LUT cascade ealization with a diect ealization using egistes and gates, as shown in Fig 4. We use a egiste pai (Reg. 1 and Reg. 0) to stoe each digit of a tenay vecto. Fo example, if the digit is * (don t cae), the egiste pai stoes (1,1). Thus, fo n bit data, we need a n-bit egiste. The compaison cicuit consists of an n-input AND

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos n Registe pais 1-bit Compaison cicuit AND Reg.1 Reg.0 0 : 0 1 1 : 1 0 * : 1 1 Reg.1 Reg.0 Inputs Pioity Encode log (k+1) Outputs Input Figue 4. Realization of the addess geneato with egistes and gates. gate and n 1-bit compaison cicuits, each of which poduces a 1 if and only if the input bit matches the stoed bit o the stoed bit is don t cae (* o 11). Fo each pefix vecto of an n-input LPM addess geneato, we need a nbit egiste, n 1-bit compaison cicuits, and an n-input AND gate. Fo an n-input addess geneato with k egisteed pefix vectos, we need k n-bit egistes, nk 1-bit compaison cicuits, and kn-input AND gates. In addition, we need a pioity encode with k inputs and log (k +1) outputs to geneate the LPM addess. If the n-input AND gate is ealized as a cascade of -input AND gates, this cicuit can be consideed as a special case of the multiple LUT cascade achitectue, whee =1,p =,andg = k. Note that the output encode cicuit is a standad pioity encode. 4 FPGA implementations We implemented the LPM addess geneatos fo 3 inputs and 504 511 egisteed pefix vectos on Xilinx Spatan-3 FPGAs (XC3S4000-5) in thee ways, the multiple LUT cascade, Xilinx CORE Geneato 7.1i, and egistes and gates. XC3S4000-5 (Spatan-3 FPGA data sheet 005) has 6 BRAMs and 7,648 slices. Each BRAM contains 18K bits, and each slice consists of two 4-input LUTs, two D-type flip-flops, and multiplexes. Fo each implementation, we descibed the cicuit by Veilog HDL, and then used Xilinx ISE 7.1i to synthesize and to pefom place and oute. Fist, we used the multiple LUT cascade to ealize the LPM addess geneatos. To use the BRAMs in the FPGA efficiently, we chose the memoy size of a cell in the LUT cascade not to exceed the size of a BRAM unit. Let p be numbe of addess lines of the memoy in the cell. Since each BRAM contains 11 bits, we have the elation: p 11, whee is the numbe of ails. Thus, we have p = log (/) + 11, whee a denotes the lagest intege less than o equal to a.

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 10 Hui Qin, Tsutomu Sasao, and Jon T. Butle Table 6. Fou multiple LUT cascade ealizations. Design Numbe of pefix vectos p Goup Level 6p11 504 6 11 8 6 7p11 508 7 11 4 7 8p11 510 8 11 8 p11 511 11 1 1 : Numbe of ails p: Numbe of addess lines of the RAM in a cell Goup: Numbe of LUT cascades WE CE Clock 11 01 11 11 3 3 8 8 8 8 0 08 (v 1 ) 3 3 8 8 8 8 1 18 (v ) Special Encode Figue 5. Realization of 8p11. We designed fou LPM addess geneatos 6p11, 7p11, 8p11, and p11, as shown in Table 6, whee the column Numbe of pefix vectos denotes the numbe of egisteed pefix vectos, the column denotes the numbe of ails, the column p denotes the numbe of addess lines of the RAM in a cell, the column Goup denotes the numbe of LUT cascades, and the column Level denotes the numbe of levels o cells in the LUT cascade. To explain Table 6, conside 8p11 which is shown in Fig 5. Fo 8p11, since the numbe of ails is = 8, the numbe of goups is 510 8 1 =. Thus, we need two LUT cascades. Since each LUT cascade consists of 8 cells, the numbe of levels of 8p11 is 8. To efficiently use BRAMs in the FPGA, the numbe of addess lines of the RAM in the cell is set to p = log (/8) + 11 = 11. Let v 1 be the value of the outputs of the uppe LUT cascade, let v be the value of the outputs of the lowe LUT cascade, and let v out be the value of the outputs of the special encode. Then, we have the elation: { v + 55 if v v out = 1 = 0 and v 0, othewise. v 1 The cicuit ealizing this expession equies 11 slices on the FPGA. Fo the whole cicuit, 8p11 equies 16 BRAMs and 6 slices. Fom this table, we can see that deceasing, inceases the numbe of goups, but deceases the numbe of levels. Next, we used the Xilinx CORE Geneato 7.1i tool to poduce Xilinx s TCAM. Since the Xilinx CORE Geneato 7.1i does not suppot TCAMs with

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos 11 3 inputs and 505 511 egisteed pefix vectos, we designed a TCAM with 3 inputs and 504 egisteed pefix vectos. The esulting TCAM equied 8,50 slices. Note that Xilinx s TCAM equies one clock cycle to find a match. Finally, we designed the LPM addess geneato with n = 3 inputs and k = 511 egisteed pefix vectos using egistes and gates, as shown in Fig 4. This design is denoted Reg-Gates. Note that the numbe of inputs is 3 and the numbe of outputs is. This design equied 7,646 slices. 5 Pefomance and compaisons In Table 7, we show the pefomance of multiple LUT cascade ealizations (i.e., 6p11, 7p11, 8p11, and p11), and compae them with Xilinx s TCAM and Reg-Gates. In Table 7, the column Level denotes the numbe of levels o cells in the LUT cascade, the column Slice denotes the numbe of occupied slices, the column Memoy denotes the amount of memoy equied, and the column F clk denotes the maximum clock fequency. The column tco denotes the maximum clock-to-output popagation delay. (It is the maximum time equied to obtain a valid output at the output pin that is fed by a egiste afte a clock signal tansition on an input pin that clocks the egiste). The column tpd denotes the maximum popagation time fom the inputs to the outputs. The column Th. denotes the maximum thoughput. Since the LPM addess geneato has outputs, it is calculated as: Th. = F clk. Fo Reg-Gates, Delay denotes the maximum delay fom the input to the output and is equal to tpd. Fo multiple LUT cascade ealizations and Xilinx s TCAM, Delay denotes the total delay, and is calculated by: Delay = 1000 Level F clk + tco, whee 1000 is a unit convesion facto. Conside the aea occupied by the vaious ealizations. Fom the Spatan-3 family achitectue (Spatan-3 FPGA data sheet 005), we can see that the aea of one BRAM is at least the aea of 16 slices (a slice consists of two 4-input LUTs, two flip-flops, and miscellaneous multiplexes). An altenative estimate shows that the aea of one BRAM is equivalent to that of 6 slices, as follows. In the Xilinx Vitex-II FPGA, one 4-input LUT occupies appoximately the same aea as 6 bits of BRAM (also containing 18K bits) (Spoull et al. 005). Note that both 4-input LUTs and BRAMs of the Vitex-II FPGA ae simila to those of the Spatan-3 FPGA. Thus,

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 1 Hui Qin, Tsutomu Sasao, and Jon T. Butle Table 7. Compaisons of FPGA implementations of the LPM addess geneato. Design Level Slice Memoy F clk tco/tpd Th. Aea a Th./Aea Delay Aea-Delay 6p11 6 178 (BRAM) (MHz) (ns) (Mbps) (slice) ( Mbps ) (ns) (slice-µs) slice 48 103.8 4.8 35 4786 0.15 8.64 35.53 (tco) 7p11 7 116 8 113.77 3.46 (tco) 104 804 0.365 84. 38.31 8p11 8 6 16 13.3 0.1 (tco) 15 (best) 1605 0.785 7.57 17.71 p11 1 1 13.08 13.7 (tco) 15 151 (best) 1.001 100.00 (best) 15.10 (best) Xilinx s TCAM 1 850.5 13.48 (tco) 03 850 0.04 57.88 (best) 47.3 Reg- Gates 7646 58.67 (tpd) 7646 58.67 161. a We assume that the aea fo one BRAM is equivalent to the aea of 6 slices. we can deduce that one BRAM of the Spatan-3 FPGA occupies about the same aea as 1 (= 18 104/6) 4-input LUTs. If we view one 4-input LUT as appoximately one-half a slice accoding to ou discussion in the pevious paagaph, we conclude that one BRAM has about the same aea as 6 (= 1/) slices. Thus, two estimates of the aea fo one BRAM, 16 and 6 slices ae quite diffeent. Fo this analysis, a wost case of 6 slices/bram was used. In Table 7, the column Aea denotes the equivalent utilized aea, whee the aea fo one BRAM is equivalent to the aea fo 6 slices. The column Th./Aea denotes the efficiency of thoughput pe aea fo one slice. The column Aea-Delay denotes the aea-delay poduct. The value denoted by best shows the best esult. Xilinx s TCAM has the smallest delay, but equies many slices. Reg-Gates has almost the same delay as Xilinx s TCAM, but equies about thee times as many slices as Xilinx s TCAM. Note that Reg-Gates equies no clock pulses in the LPM addess geneation opeation, while the othes ae sequential cicuits that equie clock pulses. Since the delay of Reg-Gates is 58.67 ns, the equivalent thoughput is (1000/58.67) = 153 (Mbps), which is lowe than all othes. All multiple LUT cascade ealizations have highe thoughput, smalle aea, highe thoughput/aea, and ae moe efficient in tems of aea-delay than Xilinx s TCAM. 8p11 has the highest thoughput and the smallest delay among all multiple LUT cascade ealizations, but is slightly less efficient in tems of aea-delay than p11. p11 has the smallest aea, the highest thoughput/aea, and the highest efficiency in tems of aea-delay among all ealizations. Although 8p11 has almost the same aea-delay as p11, its aea is 8% moe lage than that of p11. Hence, p11 is the best multiple LUT cascade ealization since it has 5.17 times moe thoughput, 40.71 times moe thoughput/aea, and is.7 times moe efficient in tems of aea-delay poduct than Xilinx s TCAM, while the aea is only 15% of Xilinx s TCAM.

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos 13 Table 8. Memoy-Delay fo multiple LUT cascade ealizations. Design 6p11 7p11 8p11 p11 Aea-Delay (slice-µs) 35.53 38.31 17.71 15.10 Memoy-Delay (BRAM-µs) 3.7.38 1.7 1.0 6 The optimum configuation of the multiple LUT cascade Fistly, conside the elation between the equied memoy size and the total aea. As can be seen fom Table 7, fo 6p11, which has the most complicated encode, the memoy equied occupies 6.3% (= 48 6 4786 ) of the total aea. Fo p11, the memoy equied occupies.1% (= 1 6 151 ) of the total aea. Note that p11 has the smallest popotion of the aea fo memoy to the total aea among all the multiple LUT cascade ealizations. Thus, the memoy consumes no less than % of the total aea. In addition, as shown in Table 7, the size of Memoy is appoximately popotional to the Aea. Hence, we can assume that the multiple LUT cascade ealization with the smallest memoy size coesponds to that with the smallest total aea. Secondly, conside the elation between Aea-Delay and Memoy-Delay poduct. As shown in Table 8, p11 has both the smallest Aea-Delay and the smallest Memoy-Delay among all multiple LUT cascade ealizations. Note that the value of Memoy-Delay is appoximately popotional to the Aea-Delay. Thus, we can assume that the ealization with the smallest Memoy-Delay coesponds to that with the smallest Aea-Delay. Theefoe, we can use the total size of memoy equied instead of the total aea, and Memoy-Delay instead of Aea-Delay to find the optimum multiple LUT cascade ealization. Doing this allows a fomal analysis, as shown in the next section. 6.1 Total size of memoy Conside the multiple LUT cascade implementation of an n-input LPM addess geneato that stoes k pefix vectos. Let each cell be ealized as a econfiguable memoy with m bits. Fo the implementations discussed peviously in this pape, this memoy is a BRAM of the Spatan-3 FPGA, whee m =18, 43 bits. Each cell in the LUT cascade has outputs, whee log (k +1). With m bits stoed in each memoy and bits pe wod, m wods ae stoed in each LUT cell. Theefoe, the numbe of addess inputs m fo each LUT cell is p() = log. Note that p() 1. Let M() be the total memoy needed to implement the given LPM addess geneato. That is, M() =msg, ()

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 14 Hui Qin, Tsutomu Sasao, and Jon T. Butle whee s = n p() is the numbe of cells in each of the g = k 1 cascades that make up the multiple LUT cascade ealization of the LPM addess geneato. Theoem 6.1 M() is a monotone deceasing function of fo p(). Since M() is monotone deceasing fo p(), to find the minimum M(), it is only necessay to find M() fo = p() and = p() 1, an uppe bound in. 6. Memoy-delay poduct Fom Table 7, we obseved that the delay in an n-input LPM addess geneato is given appoximately as D = 1000 (s +), (3) F clk whee F clk is the fequency of the clock in MHz. Let MD() be the memoydelay poduct of the multiple LUT cascade ealization of the addess geneato. Theefoe, MD() =(msg)( 1000 (s +)), (4) F clk Theoem 6. MD() is a monotone deceasing function of fo p() 5. Specially, when p( 1) = p(), MD() is a monotone deceasing function of fo p() 3. Since MD() is monotone deceasing fo p() 5, to find the minimum MD(), it is only necessay to find MD() fo = p() i, whee i=1,,..., 5, o fo five values of. Specially, when p( 1) = p(), to find the minimum MD(), we only need to conside thee cases fo = p() i, whee i=1,, and 3. 6.3 Optimum multiple LUT cascade fo the BRAM containing 18K bits, 16K bits o 4K bits In popula FPGAs, such as Xilinx s FPGAs o Altea s FPGAs, the sizes of BRAM ae 18K bits o 4K bits. Fo othe FPGAs, the size of BRAM can be 16K bits. We conside these thee types of BRAMs in the following discussion. Fo 16K-bit BRAM, fom p() = log (16384/), wehavep() = 10 fo = and p() =11fo5 8. Since p( 1) = p() =11fo5 8,

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos 15 Table. The value of that makes MD() minimum. Block RAM size The minimum Memoy-Delay 18K bits = max when max 8 = optimal when max = 4K bits = max when max 6 = optimal when max =7 16K bits = max fo 8 max is the maximum intege that satisfies both p() 1 m and log (k +1), wheep() = log and m denotes the size of a BRAM. optimal is that makes s g (s+) minimum, whee s = n p() k and g =. Fo m = 18K-bit BRAM, optimal can be 1 obtained by calculating values only fo = p() = and = p() 3 =8.Fom =4K-bitBRAM, optimal can be obtained by calculating values only fo = p() =7and = p() 3=6. fom Theoem 6., MD() deceases with when 1 (11 3) = 8. Let ζ() =s g (s + ), whee s = n p() and g = k 1. We can veify ζ(8) <ζ() when n>15. In most applications, we can assume that n>16. Thus, we can conclude that MD() is minimum when is maximum, whee 8. When the size of a BRAM is m = 18K bits, fom p() = log (/) +11, we have p() =11fo5. When m =4Kbits,fomp() = log (8/) +, we have p() = fo 5 8andp() = 10 fo = 4. Thus, fo both m = 18K-bit and m = 4K-bit, we have p( 1) = p() when p() 5. Fom Theoem 6., MD() is minimum when =11 =o =11 3 =8 fo 18K-bit BRAM, and = 1 =8, = = 7, o = 3 = 6 fo 4K-bit BRAM. We can veify ζ(6) <ζ(8) fo 4K-bit BRAM when n>14. In most applications, we can assume that n>16. Thus, we only need to conside the case of = p(). Depending on the values of n and k, MD() is minimum when = p() o = p() 3. Howeve, fo an LPM addess geneato with fixed n and k, we can easily obtain an that minimizes ζ() =s g (s+) by calculating the values fo = p() and = p() 3. Table shows the values of that minimize MD() fo thee types of BRAMs. In Table 7, Aea-Delay fo p11 and 8p11 ae nealy the same. Note that when p =11andn = 3, ζ(p ) = 0.30 and ζ(p 3) = 0.31 ae almost the same value. Fo BRAM containing 18K bits, Theoem 6.1 shows that the aea fo = is smalle than fo =8.IfMD() fo = is almost the same as fo =8, then the multiple LUT cascade is optimum when =. Simila to 4K-bit BRAM, if MD() fo = 7 and = 6 ae almost the same, then the multiple LUT cascade is optimum when = 7. The following example shows the design of an optimum multiple LUT cascade.

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 16 Hui Qin, Tsutomu Sasao, and Jon T. Butle CE0 WE 11 1-1 1-1-11 1-1 (v 1 ) Clock 11-1 11 3-1 - 3- -11 3-11 -1 (v ) 3-1 (v 3 ) Special Encode 11 (v out ) 11 4-1 4-4-11 4-1 (v 4 ) CE1 Figue 6. Optimum ealization of p11g4. Example 6.3 Conside an LPM addess geneato with n =3andk = 040 implemented on a Spatan-3 FPGA. Note that the size of a BRAM is m = 18K bits. Fist, fom p() = log (/) + 11, we have p() = 11 when 5. To obtain the optimal Aea-Delay ealization, fom Table, can be 8 o when n =3andk = 040. Let ζ() =s g (s + ). We have ζ() = 67, and ζ(8) = 640. Note that ζ() is nealy the same as ζ(8). Since the aea fo = is minimum fom Theoem 6.1, the multiple LUT cascade is optimum when p =11and =. In this case, a ealization with g = k 1 = 044 1 =4 LUT cascades is optimum. Also, the numbe of levels is n p = 3 11 = 1, which shows that each LUT cascade consists of 1 cells. Finally, we need a special encode. Let v 1 be the value of the outputs of the top LUT cascade, let v be the value of the outputs of the second LUT cascade, let v 3 be the value of the outputs of the thid LUT cascade, let v 4 be the value of the outputs of the fouth LUT cascade, and let v out be the value of the outputs of the special encode. Then, we have the elation: v 4 + 1533 if v 1 = v = v 3 = 0 and v 4 0, v v out = 3 + 10 if v 1 = v = 0 and v 3 0, v + 511 if v 1 = 0 and v 0, othewise. v 1 The optimum ealization of p11g4 is shown in Fig 6. We also implemented the LPM addess geneato with n=3 and k=040 on Xilinx Spatan-3 FPGAs (XC3S4000-5). Table 10 shows that p11g4 has

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE Implementation of LPM addess geneatos 17 Table 10. FPGA implementations of the LPM addess geneato with n=3 and k= 040. Design Level Slice Memoy F clk tco Th. Aea a Th./Aea Delay Aea-Delay (BRAM) (MHz) (ns) (Mbps) (slice) ( Mbps ) (ns) (slice-µs) slice 8p11g8 8 64 111.0 6.00 13 6443 0.10 7.4 631.04 p11g4 1 41 48 111.33 3.00 15 484 0.53 130.7 634.1 a We assume that the aea fo one BRAM is equivalent to the aea of 6 slices. 8p11g8 denotes the FPGA implementation with =8, p=11, and g=8. p11g4 denotes the FPGA implementation with =, p=11, and g=4. almost the same thoughput and aea-delay as 8p11g8, but its aea is only 75% of 8p11g8. In addition, p11g4 has highe thoughput/aea than that of 8p11g8. Thus, p11g4 is the optimum ealization fo the LPM addess geneato with n=3 and k=040. 7 Conclusions In this pape, we pesented the multiple LUT cascade to ealize LPM addess geneatos. In addition, we discussed an appoach to obtain the optimum configuation of multiple LUT cascade on FPGAs. Although we illustated the design method fo n =3andk = 504 511, it can be extended to othe values of n and k. We implemented fou LPM addess geneatos (i.e. 6p11, 7p11, 8p11, and p11) on the Xilinx Spatan-3 FPGA (XC3S4000-5) by using the multiple LUT cascade. Fo compaison, on the same type of FPGA, we also implemented Xilinx s popietay TCAM and Reg-Gates, an appoach poposed by us as a likely solution to the LPM poblem. Xilinx s TCAM has the smallest delay, but equies many slices. Reg-Gates has almost the same delay as Xilinx s TCAM, but equies the lagest aea, and equies about thee times as many slices as Xilinx s TCAM. All multiple LUT cascade ealizations have highe thoughput, smalle aea, highe thoughput/aea and moe efficient in tems of aea-delay poduct than Xilinx s TCAM. ACKNOWLEDGMENTS This eseach is patly suppoted by a Gant-in-Aid fo Scientific Reseach fom JSPS, MEXT, a gant fom Kitakyushu Aea Innovative Cluste Poject, and by an NSA contact.

Novembe 6, 006 1:58 Intenational Jounal of Electonics lpm IJE 18 Hui Qin, Tsutomu Sasao, and Jon T. Butle REFERENCES K. Pagiamtzis and A. Sheikholeslami, Content-addessable memoy (CAM) cicuits and achitectues: A tutoial and suvey, IEEE Jounal of Solid- State Cicuits, Vol. 41, No. 3, 006, pp. 71-77. H. Song and J. W. Lockwood, Efficient packet classification fo netwok intusion detection using FPGA, Poc. ACM/SIGDA 13th Intenational Symposium on Field Pogammable Gate Aays, 005, pp. 38-45. P. C. Wang, C. T. Chan, R. C. Chen, and H. Y. Chang, An efficient tenay CAMs enty-eduction algoithm fo IP fowading engine, IEE Poceedings-Communications Vol. 15, 005, pp. 17-176. S. Kasnavi, V. C. Gaudet, P. Beube, and J. N. Amaal, A novel hadwaebased longest pefix matching scheme fo TCAMs, Poc. IEEE Intenational Symposium on Cicuits and Systems, 005, pp. 333-334. Renesas Technology Inc.: M/18 M-bit Full Tenay CAM, Datasheet, Febuay 005. P. Gupta, S. Lin, and N. McKeown, Routing lookups in hadwae at memoy access speeds, Poc. IEEE INFOCOM, 18, pp. 141-147. S. Dhamapuika, P. Kishnamuthy, and D. Taylo, Longest pefix matching using Bloom filtes, Poc. ACM SIGCOMM, 003, pp. 01-1. T. Sasao, and J.T. Butle, Implementation of multiple-valued CAM functions by LUT cascades, Poc. IEEE Intenational Symposium on Multiple-Valued Logic, May 006, CD-ROM. H. Qin, T. Sasao, and J.T. Butle, Implementation of LPM addess geneatos on FPGAs, Poc. Intenational Wokshop on Applied Reconfiguable Computing (ARC006), Mach 006, pp 170-181. F. Shafai, K.J. Schultz, G.F.R. Gibson, A.G. Bluschke, D.E. Somppi, Fully paallel 30-MHz,.5-Mb CAM, IEEE Jounal of Solid-State Cicuits, Vol. 33, No. 11, 18, pp. 160-166. T. Sasao, Analysis and synthesis of weighted-sum functions, IEEE Tans. on Compute-Aided Design of Integated Cicuits and Systems, Vol. 5, No. 5, 006, pp. 78-76. Website of Xilinx is available at http://www.xilinx.com Xilinx, Inc., Spatan-3 FPGA family: Complete data sheet, DS0, Aug. 1, 005. T. Spoull, G. Bebne, C. Neely, Mutable codesign fo embedded potocol pocessing, Poc. IEEE 15th Intenational Confeence on Field Pogammable Logic and Applications, 005, pp. 51-56.