Mapping Arbitrary Logic Functions into Synchronous Embedded Memories For Area Reduction on FPGAs

Mpping Aritrry Logic Functions into Synchronous Emedded Memories For Are Reduction on FPGAs Gordon R. Chiu, Deshnnd P. Singh, Vlvn Mnohrrjh, nd Stephen D. Brown Toronto Technology Center, Alter Corportion gchiu dsingh vmnohr srown t lter.com ABSTRACT This work descries new mpping technique, RAM-MAP, tht identifies prts of circuits tht cn e efficiently mpped into the synchronous emedded memories found on field progrmmle gte rrys (FPGAs). Previous techniques developed for mpping into synchronous emedded memories cnnot e used ecuse modern FPGAs do not hve synchronous emedded memories. After technology mpping, n re-prediction cost function is used to guide the selection of logic cones to e plced in emedded memories. Extr logic is dded to compenste for missing synchronous functionlity on the synchronous memories. Experiments conducted on Alter s Strtix device fmily indicte tht this emedded memory mpping technique cn provide n verge re reduction of 6.2% nd up to 32.5% on lrge set of industril designs. A smll rchitecture chnge tht increses the size of the FPGA fric y 0.05% cn increse the verge re reduction to 14.1% nd up to 59.1% on the sme design set. 1. INTRODUCTION Designs often hve lrge mount of timing slck. In these situtions, the designers gretest concern is using the smllest possile device tht will fit their circuits s these devices re generlly less costly thn lrger devices. We present technique for using unused synchronous memories to implement portions of logic trditionlly implemented with LUTs. This, in comintion, with other re sed techniques provides the designer with tool to implement their circuit on the smllest possile progrmmle device. Modern FPGAs [1, 2] provide emedded memory locks (EMBs) to e used s on-chip memories. While there re n incresing numer of pplictions tht mke use of this on-chip memory, the re devoted to EMBs will e wsted if n ppliction does not require the memory. A poten- til solution to this prolem is to use the EMBs s ROM tht is cple of implementing multi-input multi-output logic function. Logic tht would trditionlly e mpped into logic elements is mpped into unused EMBs insted, therey incresing the mount of logic tht cn e potentilly pcked into the FPGA. In cses where the re svings re significnt, it my even e possile to select smller device to implement the circuit. Techniques for mpping comintionl logic clusters into emedded memories hve een considered in the literture [3, 4, 5, 6]. Methods similr to those used during LUT mpping were used to identify multi-input multioutput logic cluster which could e plced in n EMB. However, limittion with these methods is tht they cnnot e used to mp logic into the synchronous emedded memories present in modern FPGAs. A method tht identifies sequentil logic clusters is needed, nd we present such method in this work. In ddition, we descrie technique for hndling rchitecturl restrictions of synchronous memories such s the inility to implement the synchronous reset/preset ehviour of synchronous logic clusters. In these situtions, dditionl circuitry is dded to emulte the functionlity expected of n synchronous reset signl. 2. MEMORIES AND MEMORY MAPPING Dt Address Red Enle Permission to mke digitl or hrd copies of ll or prt of this work for personl or clssroom use is grnted without fee provided tht copies re not mde or distriuted for profit or commercil dvntge nd tht copies er this notice nd the full cittion on the first pge. To copy otherwise, to repulish, to post on servers or to redistriute to lists, requires prior specific permission nd/or fee. ICCAD 06, Novemer 5 9, 2006, Sn Jose, CA Copyright 2006 ACM 1-59593-389-1/06/0011...$5.00. Red Address Setup Red Address Hold Figure 1: Asynchronous Memory Timing Erlier use of switchle synchronous or synchronous memories in erly commercil FPGAs [7] hs lrgely shifted 135

to fully synchronous memories [1, 2]. There re severl notle differences etween synchronous memories nd their trditionl synchronous counterprts. Most importntly, synchronous memories enforce tht ll red nd write opertions re synchronized to clock edge. Contrst this pproch with the synchronous red opertion shown in Figure 1. The system is required to generte Red Enle pulse for ech dt red. This signl must meet some strict timing constrints to ensure correct functionlity. For exmple, the ddress lines must e stle for certin mount of time efore the leding edge of the red enle (Address Setup Time) nd remin stle for certin mount of time fter the flling edge of the enle signl (Address Hold Time). Synchronous memories void these complictions s the designer only needs to ensure tht their ddress, dt nd control signls rech the memory s registered interfce efore the next ctive clock edge. The synchronous memory lock cn then internlly generte red enle stroe tht is gurnteed to meet the synchronous setup nd hold constrints. Note tht there is still setup nd hold requirement t the registered interfce; however, these constrints re gurnteed to e met y the rchitecture if glol clock is utilized. The synchronous design style gretly reduces the potentil for errors due to mislignments in the timing of the synchronous signls. In ddition, the registered interfce gretly reduces the switching ctivity of signls entering the internl memory circuitry nd offers potentil for significnt power svings. Given these dvntges, synchronous memories hve ecome populr uilding lock in modern FPGA frics [1, 2]. 2.1 Limittion on Memory Mpping As modern FPGAs do not hve synchronous emedded memories, previously descried techniques in the literture [3, 4, 5, 6] re not pplicle. The forced use of synchronous emedded memories further restricts the set of permissile logic cones tht cn e mpped into memory: ech pth through the cone of logic must hve one nd only one register. It is not pprent how to modify the previously descried techniques to enforce selection of cones tht meet this constrint. The prolem of mpping logic into synchronous emedded memories hs not een previously considered in the literture. 3. PRELIMINARIES AND PROBLEM We use the following definitions for this pper. A circuit is represented y directed grph G(V,E) wherethe vertices V represent comintionl (4-LUT) or register nodes, nd the edges E represent dependencies etween the nodes. A node my e comintionl or sequentil (register with one input). Given node v, cone rooted t v is su-network contining v nd some of its predecessors. We define cone rooted t set of nodes W to e sunetwork contining ech node in W long with nodes tht re predecessors of t lest one node in W. Given cone C rooted t node v, we cn we define the support set of v to e the set of inputs, M, tothelrgestcone rooted t v which contins no register nodes (other thn possily v). It is noted tht the lrgest such cone is unique for ny given v. A fnout-free cone is cone in which no node in the cone (except the root) drives node not in the cone. The mximum fnout-free cone (MFFC) fornode is the fnout-free cone rooted t the node contining the lrgest numer of nodes. If the dely within circuit components nd the dely of connections etween circuit components re known, timing nlysis cn e used to estlish the slck [8] of every connection. The slck of connection is defined to e the mount of dely tht cn e dded to the connection efore it ecomes criticl. A connection is criticl if the dely of pth it elongs to exceeds the pth-length constrint set y the user. Timing nlysis lso estlishes slck rtio for ech connection. The slck rtio is vlue etween 0 nd 1 which indictes the reltive importnce of ech connection to overll circuit timing. Connections tht hve significnt effect on circuit timing hve slck rtios closer to 0 while connections tht hve negligile effect on circuit timing hve slck rtios closer to 1. In the sence of timing constrints, the reltionship etween slck rtio nd slck is given y: slck rtio(c) = slck(c) minslck T mx where minslck refers to the worst cse connection slck in the circuit nd T mx refers to the mximum dely of ny register-to-register pth. Connections tht hve significnt effect on circuit timing hve slck rtios closer to 0 while connections tht hve negligile effect on circuit timing hve slck rtios closer to 1. A precise definition of slck rtios, in the presence of multiple timing constrints, is eyond the scope of this pper. However, from n optimiztion perspective, slck rtios provide the most ccurte criticlity informtion s the formultion ccounts for multi-cycle clocks, inverted clocks nd clock skew. One of the most powerful dely optimiztion techniques is sequentil retiming [9, 10]. This technique moves registers cross comintionl circuit elements to reduce the length of timing-criticl pths. We define implicit retiming to e the sequentil retiming implicitly performed when restructuring cone of logic to e comptile with n EMB which requires registers on ll inputs. We mke the ssumption tht the mximum numer of ville EMBs on chip is fixed. Assuming the chip hs N ville EMBs, we wish to find mpping of the circuit into n synchronous EMBs nd m logic elements such tht n N nd m is minimized. The width of n EMB is defined s the width (numer of its) of ech dt word of the EMB. The depth of n EMB is defined s the numer of ddress its required. Thus n EMB hs 2 d words, ech word w its in length. 3.1 Trget Architecture Assumptions We consider two different clssifictions of FPGA memory rchitectures for the purposes of this study. Specificlly, the two clsses differ in the mount of functionlity ville on the emedded memory lock. The emedded memory locks in the first rchitecture clss, rch no clr, hve no synchronous cler functionlity on the output of the memory lock; most modern commercilly ville FPGAs [1, 2] fll in this ctegory. The second rchitecture clss, rch clr,isidenticlto the first rchitecture ut dds synchronous cler func- 136

clr tionlity to the outputs of the emedded memory locks in the FPGA fric. Tht is, when the synchronous cler is ctive, the outputs of the emedded memory lock immeditely cler to zero or preset vlue. This dds smll numer of trnsistors per output of the emedded memory lock nd is estimted to increse the re of the emedded memory lock y pproximtely 0.05%. 4. THE RAM-MAP TECHNIQUE The RAM-MAP technique consists of severl steps. First, seed node is selected from the circuit. A cone of nodes is then grown from the seed node using cost function to chieve greter re reduction. If the cost function indictes tht mpping the cone will result in n re decrese, the cone of logic is replced y n equivlent EMB nd, if necessry, synchronous fix-up logic. The process is then repeted until ll possile seed nodes re exhusted or ll ville EMBs re used. It is difficult to discuss the selection of sequentil logic cones for mpping without n understnding of the method through which the cones re mpped into EMBs. Thus, the stges of the RAM-MAP technique re presented in reverse order. Given set of nodes X which stisfy certin constrints, section 4.1 presents method for deriving n EMB Y which is equivlent to the cone X. Section 4.3 descries the cost function used to evlute set of nodes. Section 4.4 descries heuristic for selecting sets of nodes tht stisfy the constrints nd whose mpping will result in significnt re reduction. 4.1 Mpping Sequentil Logic Let X e cloud of sequentil logic. Let M nd N e the inputs to nd outputs from the cloud. The set N is the set of nodes n such tht n X nd n drives node not in X. ThesetM is the set of ll nodes m such tht m/ X nd m drives node in X. Note tht the ove conditions imply the constrint tht every node in X must e in the fnout-free cone rooted t the set of nodes N. Thus X is suset of the mximum fnout-free cone rooted t the set of nodes N. Let R e the set of register nodes within the cone X. Let us ssume ech pth from ech of the inputs (M) to the outputs (N) trverses t lest one register, nd ll registers shre the sme control signl set (clock, clock enle, synchronous clers, etc). For now, simplifying ssumption is mde: no synchronous cler or reset signls re used. The cone X cn e seen s multi-input, multi-output, stte mchine whose stte is encodedinthesetofregistersr. If we cn implement n equivlent stte mchine Y using EMBs, nd connect those nodes driven y N to Y, the cone of sequentil logic X cn e entirely removed. Figure 2 is n exmple showing the restructuring of n ritrry cone of sequentil logic into one comptile for mpping. Let F e ll registers in the cone whose inputs re, directly or indirectly, from nother register in the cone. More formlly, F is the set of ll registers f R such tht the support set of f, SS(f) contins register in R; thtis,ss(f) R. We note tht we cn restructure the cone to ensure ech pth through the cone trverses exctly one register node. We force the input to ech of the registers in F to leve the cone nd re-enter the cone. In the exmple in Figure 2, dditionl cone inputs nd outputs F 1 nd F 2 re creted. We cn trnsform this newly restructured cone into n EMB Y of width w nd depth d where the inputs re driven y the originl inputs nd the new feedck inputs (M F ) nd the outputs re driven y the originl outputs nd the new feedck outputs (N F ). Note tht ech pth through Y trverses exctly one register node. The F feedck signls re oth outputs of nd inputs to Y nd represent the stte of stte mchine encoded into F its. We note tht the register nodes on ech pth through Y cn e implicitly retimed to the inputs of Y. An emedded memory lock cn e used to implement ny Y derived in this fshion provided the numer of its required does not exceed the cpcity of the memory. The requirement tht inputs e registered is inherently stisfied. Figure 3 shows the implicit retiming of the restructured cone into n EMB. 4.2 Compensting for Asynchronous Resets Mny synchronous circuits use synchronous reset signls. It is expected tht for the mjority of sequentil logic cones selected for mpping, registers in the cone will hve synchronous reset signls. We ssume tht ech cone of logic X hs mximum of one unique synchronous cler signl, s. One mjor rchitecturl constrint regrding trget rchitecture clss rch no clr, which includes most modern FPGAs, is due to the lck of synchronous cler signls on EMBs. When pplied to the input register, the synchronous cler signl immeditely clers the input registers. However, the output of the memory does not show the effect of the synchronous cler until the next rising clock edge. Thus, the synchronous cler of the EMB cnnot implement the required synchronous reset functionlity of the register. As result, for this rchitecture, we need to dd dditionl logic outside the EMB, in the form of dditionl logic elements, to give the correct ehviour upon synchronous reset: one register per synchronous cler signl s nd one comintionl node per output of the EMB, s seen in Figure 4. Although the register on the synchronous cler signl cn e shred with susequent mppings, the comintionl nodes for ech output cnnot. This dditionl logic reduces the expected re gins. Often, this compenstion logic hs lrger re thn the replced cone, rendering the opertion counter-productive. The cost function for re reduction is modified to ccount for this, s descried in 4.3. In ddition, the dely increses with the ddition of the extr comintionl logic. reset 0 1 clk clr EMB o1 o2 reset 0 1 clk EMB Figure 4: Memory with Asynchronous Reset, nd Equivlent Implementtion o1 o2 137

Figure 2: Restructuring Sequentil Logic Cloud EMB Figure 3: Implicit Retiming into n Emedded Memory Block For FPGAs of the rchitecture clss rch clr, no dditionl synchronous fix-up circuitry is required. 4.3 Are Reduction Cost Function Given cone of sequentil logic X, cost function is used to guide the growing of the cone. The cost function consists of weighted sum of two components: the Are Reduction Cost, proportionl to the predicted chnge in the numer of logic elements, nd the Memory Use Cost, proportionl to the numer of its of memory required to implement the cone. The re reduction cost is given s: c(x) = k j + where k is the predicted numer of logic elements removed, j is the predicted reduction in logic elements due to collpsing, nd is the numer of logic elements dded to correct the synchronous reset ehviour. If n synchronous reset is used, the numer of logic elements required = d, the numer of outputs of the new EMB. If no synchronous reset is used, =0. A logic element cn e removed if oth its comintionl nd register nodes cn e removed. If ech node is either in X or unused, the logic element is predicted to e removed. A comintionl-only logic element cn e collpsed if it hs fn-out of one nd cn e merged into the output. It is expected tht some dded compenstion nodes cn e collpsed in this mnner. The memory use cost is the size of the EMB in its, w2 d,whered nd w re defined in section 4.1. A lrge penlty is ssigned if the required size is lrger thn the size of ll ville EMBs. 4.4 Cone Growth Heuristic Our heuristic for growing the cones to e mpped proceeds in two phses. In the first, set of nodes on the output of register nodes is selected for mpping. Second, nodes from the input of the register nodes re dded to the set nd implicitly retimed. We refer to the first stge s selection nd to the second s expnsion. Figure5gives n overview of the cone grow heuristic, nd Figure 6 shows n exmple cone selection nd expnsion. 4.4.1 Node Set Selection At the eginning of ech itertion of the heuristic, seed node c is selected from ll nodes in the circuit who hve not prticipted in mpping. A simple greedy heuristic is then used to grow the cone from the seed node. The set of ll registers comptile with ech register in the support set of c (those tht shre ll control signls) is determined. The live set, LiveSet(c), is the union of the comptile register set with ll nodes whose support set is suset of the comptile register set. Thus LiveSet(c) is the set of possile nodes to dd which still stisfy the register comptiility constrint. If the live set is empty the itertion of mpping fils nd is repeted with new seed node. From the live set, nodes re greedily selected nd dded to the set of nodes to e mpped. The process of dding node to the set my cuse multiple nodes to e dded. The input nodes re recursively dded up to nd including the support set. This ensures tht the cone to e mpped remins connected. Ech cndidte node in turn is testdded to the set, expnded through the node set expnsion heuristic, nd then evluted y the cost function. The cndidte node resulting in the lowest cost is dded to the set. The itertion is terminted when dding ech node results in higher cost. Figures 6() nd 6() show n exmple of the selection process. A cndidte node, nd ll nodes up to its support set, re dded to the set. 4.4.2 Node Set Expnsion Given set X to e mpped (with register nodes R), during the expnsion phse we dd nodes from the inputs of R into our mpping set. When mpping is performed, the registers re implicitly retimed cross these dded nodes. Only nodes tht re in the fnout-free cone rooted t the register nodes R nd not in the LiveSet re eligile for inclusion. At the time of node-set expnsion, w, the numer of outputs of the set X is known. We cn clculte the mximum numer of inputs d such 138

Cndidte Node Existing Cone New Cone New Cone () () (c) Figure 6: Exmple Cone Selection nd Expnsion 1 C Circuit 2 for c C 3 if LiveSet(c) { } 4 eset InputAndSupport(c) 5 set Select(eSet, LiveSet(c)) 6 (mpset, mpcost) Expnd(set) 7 if mpcost < 0 8 PerformMpping(mpSet) 9 end if 10 end if 11 end for 12 13 function Select(X, L) 14 estset X 15 (set, estcost) Expnd(X) 16 for x L 17 X X {x} InputAndSupport(x) 18 (set, cost) Expnd(X ) 19 if cost < estcost 20 estset X 21 estcost cost 22 end if 23 end for 24 estset Select(estSet, L estset) 25 return estset 26 end function 27 28 function Expnd(X) 29 estset X 30 estcost Cost(X) 31 for x MFFC(X) Inputs(X) 32 X X {x} 33 if Cost(X ) < Cost(X) 34 estset X 35 estcost Cost(X ) 36 end if 37 end for 38 (estset, estcost) Expnd(estSet) 39 return (estset, estcost) 40 end function Figure 5: An overview of the Cone Grow Heuristic. tht the resulting d-input, w-output function will fit into the lrgest ville EMB. The prolem is similr to finding the mximum-volume d-fesile cut of the mximum fnout free cone rooted t the registers R. An lgorithm for finding this cut ws presented in [11], ut would not to e pproprite due to our specilized cost function. For our implementtion, we employ simple greedy heuristic to perform the node set expnsion. We test-dd ech node to the set, nd evlute it y the cost function. The node resulting in the lowest cost is dded to the set. The process is repeted until locl minim is reched. Figures 6() nd 6(c) show n exmple of the expnsion process. Nodes on the inputs of the registers re dded to theconetoempped. 4.4.3 Performnce Considertions EMBs re considerly slower thn comintionl lookup tles, so it is expected tht, without modifiction, the RAM-MAP technique will significntly reduce the mximum frequency of opertion of the circuit. The technique cn e modified to prevent the selection of criticl comintionl nodes. We first perform timing nlysis step using sttisticl dely model descried in [12]. The expected slck (ES) of cone of logic fter mpping to memory cn then e estimted using the minimum expected slck of ll outputs: ES = min (slcko + LUTDelyo memorydely) o outputs where slck o is the slck t n output of the cone, LUTDely o is the dely of the shortest pth from the output to register in the cone, nd memorydely is the expected comintionl dely of the EMB nd synchronous fix-up logic. The expected slck rtio (ESR) is then clculted from the expected slck nd the concept of slck rtio threshold (SRT) is employed. The SRT defines threshold elow which the expected slck rtio should not fll. If the expected slck rtio is elow the threshold, the opertion is deemed to significntly nd dversely ffect timing nd is not performed. If the expected slck rtio remins ove the threshold, the opertion is performed s norml. Thus, when selecting nodes, cndidte node which cuses the cone to hve n ESR < SRT is rejected. Due to the implicit retiming, the comintionl logic dely on the input of the EMB cnnot increse. Thus ny node is cceptle for inclusion during the expnsion process. 4.5 Implementtion Efficiency The implementtion of our lgorithm includes severl optimiztions tht do not reduce the worst-cse runtime, ut still significntly speed up the technique. Dynmic progrmming is utilized in the expnsion phse of the technique. A solution cche is indexed on two chrcteristics of the cone: the set of registers of the cone X to e expnded s well s the llowle increse in numer of 139

inputs (clculted from the mximum memory size, nd the current numer of outputs). Two cones shring these chrcteristics cn use the sme solution from the cche. Brnch pruning is frequently employed when performing the greedy cone selection nd expnsion heuristics. One exmple is the removl of nodes from the LiveSet. When the node resulting in the lrgest re reduction is dded to the cone, the nodes consistently resulting in lrge re increses re removed from future considertion for inclusion with this cone. This pruning does not significntly ffect the qulity of the finl solution. 5. EXPERIMENTAL RESULTS Alter s Strtix [13] chips were used s the trget for the logic mpping experiment. The chip is comprised of I/O elements (IOEs), logic rry locks (LABs), digitl signl processing locks (DSPs) nd emedded memory locks (M512, M4K nd M-RAM). A LAB in Strtix device contins 10 logic elements (LEs). The Strtix LE contins four-input lookup tle (4-LUT), register nd some logic tht fcilittes the cretion of rithmetic circuits. The chip is composed of three types of emedded memory locks: the 512-it M512, the 4096-it M4K, nd the 512-Kit M-RAM lock. Ech of these locks is synchronous, requiring registered ddress, dt, nd control signl inputs, nd optionlly registered outputs. Additionlly, the M-RAM does not support memory initiliztion nd cnnot e used for logic mpping. The chip does not hve n synchronous cler ville on the emedded memory locks, nd clssifies into the rch no clr rchitecture clss. Ech type of emedded memory lock cn e used in multiple width nd depth configurtions. For exmple, the 512-it M512 cn e used in 512-ddress y 1-it word configurtion (9 ddress inputs nd 1 dt output), 256 2 configurtion (8 inputs nd 2 outputs), nd others up to nd including 32 16 (5 inputs nd 16 outputs). 5.1 Results for the rch no clr Architecture In our experiments, ll steps of the FPGA CAD flow were performed y modified version of Qurtus II v5.0. After technology mpping ut prior to plcement, the flow ws modified to perform the RAM-MAP EMB mpping technique. We study the enefits of pplying the EMB mpping technique on 87 industril circuits. Qurtus ws run twice, first with RAM-MAP turned off, nd then with it turned on. For ech circuit, the device chosen ws the smllest Strtix-fmily device tht could fit the circuit (with RAM- MAP off). The numer of logic elements oserved t the end of ech run is used to compute the re reduction oserved s result of pplying the technique. Note tht s these circuits re industril, numer of them lredy utilize the emedded memories. For some of the circuits, very few memories re ville for use y RAM-MAP. Figure 7 presents the re reduction oserved for ech circuit s result of pplying the RAM-MAP technique with n slck rtio threshold of (ll opertions re ccepted). A men re reduction of 6.2% ws oserved. No circuits were oserved to increse in re s result of pplying the technique ecuse the cost function is le to predict the resulting re with perfect ccurcy (nd rejects re-incresing mppings). On verge, the re reduction per used M4K is 4.59 logic elements nd 2.28 logic elements per used M512. 5.2 Results for the rch clr Architecture It is cler tht the need to compenste for synchronous clers diminishes the re reduction ville from RAM- MAP when using the rch no clr rchitecture clss, which includes Alter s Strtix device. If n synchronous cler were ville on the EMB, it is expected tht the re reduction should increse. We cn quntify this prediction y repeting the experiment, ut ssuming the Strtix rchitecture is modified (s per section 3.1) to e of clss rch clr. The flow is identicl, except we do not crete synchronous cler compenstion logic elements nd djust our cost-function ccordingly. Figure 7 presents the re reduction oserved for ech circuit s result of pplying the RAM-MAP technique, ssuming n rchitecture of clss rch clr,withnsrt of (ll opertions re ccepted). A men re reduction of 14.1% ws oserved. The re reduction nd performnce degrdtion re higher due to logic cones whose mpping ws previously undesirle (due to the synchronous cler cost) now eing mpped. 5.3 Impct on Performnce The primry gol of the EMB mpping technique is to decrese re. EMBs re much slower thn conventionl comintionl lookup tles, nd the constrined physicl loction of EMBs locks on the chip introduces dditionl plcement nd routing constrints. It is therefore expected tht mpping logic into EMBs will reduce the performnce of the circuit. In otining the re reduction indicted ove for the rch no clr rchitecture, with slck rtio threshold of (ll opertions re ccepted), the technique decreses the mximum frequency of opertion of the circuit y 18.2% (men). This performnce degrdtion is lrge, nd thus this technique is not pplicle to ll designs. However, for those designs with lrge mounts of timing slck, the designer s gretest concern is using the smllest possile device tht will fit their circuit s this device is generlly less costly thn lrger devices. This technique my help the designer utilize smller device thn otherwise possile. With the ddition of the rch clr rchitecture modifiction, the reduction in the mximum frequency of opertion is 34.5% (men). This increse in reduction is due to the incresed numer of mppings tht tke plce. Note tht these performnce degrdtions re mesured fter plcement, routing nd finl signoff timing nlysis. The slck rtio threshold cn e ppropritely chosen to reduce the performnce penlty t the cost of less re reduction. Figure 8 shows the performnce versus re trde-off for 7 vlues of the slck rtio threshold. Ech point represents the verge re reduction versus the verge performnce reduction of the entire enchmrk of 87 circuits, run with different slck rtio threshold prmeter. Becuse the RAM-MAP technique is performed prior to plcement nd it is difficult to predict the postplcement dely [14], it is very difficult to oth predict nd control the performnce reduction. 140

Are Reduction (percent of originl circuit) 60% 50% 40% 30% 20% 10% 0% RAM-MAP for rch clr RAM-MAP for rch no clr Circuits Figure 7: Are Reduction on 87 Industril Circuits Performnce (percent chnge) 0% -2% -4% -6% -8% -10% -12% -14% -16% -18% SRT=0.625 SRT=0.500 SRT=0.375 SRT=0.250 SRT=0.125 SRT=0.000-20% 0% 1% 2% 3% 4% 5% 6% 7% Are Reduction (percent of originl circuit) SRT= Figure 8: Performnce vs. Are Trde-off 6. ADVANCED TECHNIQUES This section presents two techniques for further incresing re reduction from the RAM-MAP technique. 6.1 Negtive-Edge Clocked Memory One of the limittions of the RAM-MAP technique descried ove is tht ll cones to e mpped must contin registers. Every pth from input to output in the cone must trverse t lest one register, ecuse every EMB requires registered inputs. If purely synchronous EMBs were ville, it would e possile to mp cone of comintionl logic etween registers into memory. It is occsionlly possile to implement n synchronous EMB using synchronous EMB [15]. This is ccomplished y clocking the synchronous EMB with n inverted clock. Stringent conditions must hold in order to e le to perform this mpping. First, ll registers rechle y trversing the fn-in or fn-out network of the mpped cone must hve comptile control signls. Secondly, two EMBs used in this mnner cnnot fn-in or fn-out to ech other. It is importnt to note tht fix-up logic still needs to e dded, if the registers on the fn-in of the mpped cone hve synchronous clers. This is ecuse the results of n synchronous cler ctivted fter the flling edge of the clock will not propgte through the memory to the inputs of the next level of registers. Figure 9 shows n exmple of mpping cone of logic into negtive-edge clocked memory element. A rief description of the lgorithm follows. First, ll sets of registers with comptile control sets re identified, nd ech set is given unique identifier. In the exmple figure, the set identifier is noted on the register. Second, the fn-in nd fn-out networks of ech register re recursively trversed nd nnotted with the identifier for the register, s seen in Figure 9(). All comintionl nodes with only one nnottion re ounded y comptile registers nd re mpple. Next, regions of connected mpple nodes re identified, s seen in Figure 9(). A suset of ech region is selected for mpping y cost function. In the exmple, the entire region is selected for mpping; this my e neither desirle (due to synchronous fix-up logic) nor fesile (due to the mximum size of EMBs). Finlly, the selected cone of logic is replced y n equivlent synchronous EMB, with n inverted clock signl, s seen in Figure 9(c). As proof of concept, n implementtion of this technique with simple greedy heuristic to select cones cn chieve reduction in re of 1.1% (up to 7.2%) on top of ny reductions relized through the RAM-MAP technique reported in Section 5. A etter lgorithm for incresed re reduction is n re of future reserch. 6.2 Stte Mchine Re-Encoding The mpping method we descried identifies cone of sequentil logic which is turned into finite stte mchine ndthenplcedinnemb.stndrdtextookmethods of stte mchine reduction [16] cn e used to reduce the numer of sttes therey reducing oth the EMB size s well s ny logic needed to emulte n synchronous reset. Are reduction cn e relized if the numer of its required to encode the stte mchine is reduced. Since the signls crrying the stte mchine encoding pper t oth the inputs nd outputs of n EMB, oth the numer of RAM inputs nd outputs cn e reduced, reducing the need for synchronous fix-up logic. An implementtion of this technique cn chieve reduction in re of up to 1.2%, primrily on circuits where feedck structures re common. On verge cross the circuit set, reduction of 0.1% is relized. These results 141

,,.. EMB c c c, c, c () () (c) Figure 9: Mpping into Negtive-Edge Clocked Memory Elements re on top of those reported in Section 5. 7. CONCLUSION In this pper, we descrie new mpping technique. The RAM-MAP technique mps comintionl nd sequentil logic into unused emedded memory locks to reduce the numer of logic elements required to implement the circuit. The technique is lso le to stisfy two constrints of the trget rchitecture s emedded memory locks. Extr logic is dded to compenste for the lck of synchronous cler on EMBs. Specil considertions re mde when selecting nd mnipulting cones to ensure cones cn e mpped into the input-registered, synchronous EMBs. On set of 87 industril circuits, the RAM-MAP technique is le to reduce on verge y 6.2% nd up to 34.4% the numer of logic elements required to implement the circuit on our trget rchitecture, Strtix. With smll chnge to the rchitecture tht increses overll FPGA size y 0.05%, the potentil re reduction is incresed on verge to 14.1% nd up to 59.1%. 8. REFERENCES [1] Alter Corportion, Alter Product Ctlog, My 2005. [2] Xilinx Corportion, Virtex Series FPGAs Product Mtrix, My 2005. [3] S. J. E. Wilton, Smp: Heterogeneous technology mpping for re reduction in FPGAs with emedded memory rrys, Proceedings of ACM/SIGDA Interntionl Symposium on Field-Progrmmle Gte Arrys, Ferury 1998. [4] S. J. E. Wilton, Heterogeneous technology mpping for re reduction in FPGAs with emedded memory rrys, IEEE Trnsctions on Computer-Aided Design of Integrted Circuits nd Systems, vol. 19, pp. 56 68, Jnury 2000. [5] J. Cong nd S. Xu, Technology mpping for FPGAs with emedded memory locks, Proceedings of the ACM/SIGDA Interntionl Symposium on Field-Progrmmle Gte Arrys, pp. 179 188, Ferury 1998. [6] M. Kumr, J. Bo, V. Kmkoti, MemMp: Technology Mpping Algorithm for Are Reduction in FPGAs with Emedded Memory Arrys Using Reconvergence Anlysis, Design, Automtion nd Test in Europe Conference nd Exhiition Volume II (DATE 04), pp. 922 929, 2004. [7] F. Heile nd A. Lever, Hyrid product term nd LUT sed rchitectures using emedded memory locks, Interntionl Symposium on Field ProgrmmleGteArrys(FPGA), 1999. [8] R. Hitchcock, G. Smith, nd D. Cheng, Timing nlysis of computer-hrdwre, IBM Journl of Reserch nd Development, pp. 100 105, Jnury 1983. [9] C. Leiserson, F. Rose, nd J. Sxe, Optimizing Synchronous Circuitry, 1983. [10] C. Leiserson nd J. Sxe, Retiming Synchronous Circuitry, 1991. [11] J. Cong nd Y. Ding, Flowmp: An optiml technology mpping lgorithm for dely optimiztion in lookup-tle sed FPGA designs, IEEE Trnsctions on Computer-Aided Design of Integrted Circuits nd Systems, vol. 13, pp. 1 12, Jnury 1994. [12] D. Singh, V. Mnohrrjh, nd S. Brown, Two-stge physicl synthesis for FPGAs, Custom Integrted Circuits Conference (CICC), Septemer 2005. To pper. [13] Alter Corportion, Strtix Device Hndook (Complete Two-Volume Set), July 2005. [14] V. Mnohrrjh, G. Chiu, D. Singh, nd S. Brown, Difficulty of Predicting Interconnect Dely in Timing Driven FPGA CAD Flow, Proceedings of the 2006 Interntionl Workshop on System Level Interconnect Prediction, pp. 3 8, Mrch 2006. [15] Alter Corportion, Appliction Note 210: Converting Memory from Asynchronous to Synchronous for Strtix nd Strtix GX Designs, Novemer 2002. [16] Z. Kohvi, Switching nd Finite Automt Theory. McGrw-Hill Pulishing Compny, 2nd ed., 1978. 142