L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture

Similar documents
Chapter 7 Registers and Register Transfers

Energy-Efficient FPGA-Based Parallel Quasi-Stochastic Computing

Read Only Memory (ROM)

EE260: Digital Design, Spring /3/18. n Combinational Logic: n Output depends only on current input. n Require cascading of many structures

Logistics We are here. If you cannot login to MarkUs, me your UTORID and name.

What Does it Take to Build a Complete Test Flow for 3-D IC?

L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture

DIGITAL SYSTEM DESIGN

Line numbering and synchronization in digital HDTV systems

Reliable Transmission Control Scheme Based on FEC Sensing and Adaptive MIMO for Mobile Internet of Things

Quality improvement in measurement channel including of ADC under operation conditions

PROBABILITY AND STATISTICS Vol. I - Ergodic Properties of Stationary, Markov, and Regenerative Processes - Karl Grill

THE Internet of Things (IoT) is likely to be incorporated

Voice Security Selection Guide

Randomness Analysis of Pseudorandom Bit Sequences

Polychrome Devices Reference Manual

CCTV that s light years ahead

RELIABILITY EVALUATION OF REPAIRABLE COMPLEX SYSTEMS AN ANALYZING FAILURE DATA

Image Intensifier Reference Manual

References and quotations

A Backlight Optimization Scheme for Video Playback on Mobile Devices

A Simulation Experiment on a Built-In Self Test Equipped with Pseudorandom Test Pattern Generator and Multi-Input Shift Register (MISR)

NewBlot PVDF 5X Stripping Buffer

Forces: Calculating Them, and Using Them Shobhana Narasimhan JNCASR, Bangalore, India

PowerStrip Automatic Cut & Strip Machine

2 Specialty Application Photoelectric Sensors

Mullard INDUCTOR POT CORE EQUIVALENTS LIST. Mullard Limited, Mullard House, Torrington Place, London Wel 7HD. Telephone:

The new, parametrised VS Model for Determining the Quality of Video Streams in the Video-telephony Service

CODE GENERATION FOR WIDEBAND CDMA

Motivation. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics

The Communication Method of Distance Education System and Sound Control Characteristics

Manual RCA-1. Item no fold RailCom display. tams elektronik. n n n

Research Article Measurements and Analysis of Secondary User Device Effects on Digital Television Receivers

Working with PlasmaWipe Effects

PROJECTOR SFX SUFA-X. Properties. Specifications. Application. Tel

Achieving 550 MHz in an ASIC Methodology

An Adaptive Technique for Reducing Leakage and Dynamic Power in Register Files and Reorder Buffers

STx. Compact HD/SD COFDM Transmitter. Features. Options. Accessories. Applications

Implementation of Expressive Performance Rules on the WF-4RIII by modeling a professional flutist performance using NN

MODELLING PERCEPTION OF SPEED IN MUSIC AUDIO

Analyzing the influence of pitch quantization and note segmentation on singing voice alignment in the context of audio-based Query-by-Humming

T-25e, T-39 & T-66. G657 fibres and how to splice them. TA036DO th June 2011

BesTrans AOC (Active Optical Cable) Spec and Manual

Performance. Performance

Higher-order modulation is indispensable in mobile, satellite,

PIANO SYLLABUS SPECIFICATION. Also suitable for Keyboards Edition

9311 EN. DIGIFORCE X/Y monitoring. For monitoring press-fit, joining, rivet and caulking operations Series 9311 ±10V DMS.

A Low Power Delay Buffer Using Gated Driver Tree

Internet supported Analysis of MPEG Compressed Newsfeeds

8825E/8825R/8830E/8831E SERIES

11. Sequential Elements

ROUNDNESS EVALUATION BY GENETIC ALGORITHMS

Manual Comfort Air Curtain

Australian Journal of Basic and Applied Sciences

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

TRAINING & QUALIFICATION PROSPECTUS

NexLine AD Power Line Adaptor INSTALLATION AND OPERATION MANUAL. Westinghouse Security Electronics an ISO 9001 certified company

Hardware Design I Chap. 5 Memory elements

RHYTHM TRANSCRIPTION OF POLYPHONIC MIDI PERFORMANCES BASED ON A MERGED-OUTPUT HMM FOR MULTIPLE VOICES

Apollo 360 Map Display User s Guide

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Image Enhancement in the JPEG Domain for People with Vision Impairment

Recognition of Human Speech using q-bernstein Polynomials

Manual Industrial air curtain

Before you submit your application for a speech generating device, we encourage you to take the following steps:

VOCALS SYLLABUS SPECIFICATION Edition

Sigma 3-30KS Sigma 3-30KHS

NIIT Logotype YOU MUST NEVER CREATE A NIIT LOGOTYPE THROUGH ANY SOFTWARE OR COMPUTER. THIS LOGO HAS BEEN DRAWN SPECIALLY.

COMMITTEE ON THE HISTORY OF THE FEDERAL RESERVE SYSTEM. Register of Papers CHARLES SUMNER HAMLIM ( )

2 Specialty Application Photoelectric Sensors

... clk. 10 Registers and counters

MOBILVIDEO: A Framework for Self-Manipulating Video Streams

LFSR Counter Implementation in CMOS VLSI

Research on the Classification Algorithms for the Classical Poetry Artistic Conception based on Feature Clustering Methodology. Jin-feng LIANG 1, a

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

Tutorial Outline. Typical Memory Hierarchy

THE USE OF forward error correction (FEC) in optical networks

Obsolete Product(s) - Obsolete Product(s)

Canon Canada Builds Its New LEED Gold Certified Canadian Headquarters in Partnership with Applied Electronics

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

Background Manuscript Music Data Results... sort of Acknowledgments. Suite, Suite Phylogenetics. Michael Charleston and Zoltán Szabó

data and is used in digital networks and storage devices. CRC s are easy to implement in binary

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

Quantifying Domestic Movie Revenues Using Online Resources in China

,..,,.,. - z : i,; ;I.,i,,?-.. _.m,vi LJ

FHD inch Widescreen LCD Monitor USERGUIDE

Our competitive advantages : Solutions for X ray Tubes. X ray emitters. Long lifetime dispensers cathodes n. Electron gun manufacturing capability n

Further Details Contact: A. Vinay , , #301, 303 & 304,3rdFloor, AVR Buildings, Opp to SV Music College, Balaji

Comparative Study of Different Techniques for License Plate Recognition

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.5

Combinational vs Sequential

Guide to condition reports for domestic electrical installations

Lecture 23 Design for Testability (DFT): Full-Scan

On the Rules of Low-Power Design

Math of Projections:Overview. Perspective Viewing. Perspective Projections. Perspective Projections. Math of perspective projection

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Minimum Span. Maximum Span Setting

Index. LV Series. Multimedia Projectors FULL LINE PRODUCT GUIDE. usa.canon.com/projectors. REALiS LCOS Projectors. WUX10 Mark II D WUX10 Mark II...

Transcription:

L-CBF: A Low-Power, Fast Coutig Bloom Filter Architecture Elham Safi, Adreas Moshovos, ad Adreas Veeris Electrical ad Computer Egieerig Departmet Uiversity of Toroto {elham, moshovos, veeris@eecg.utoroto.ca} ABSTRACT We study the eergy, latecy ad area characteristics of two Coutig Bloom Filter implemetatios usig full custom layouts i a commercial.3µm techology. The first implemetatio, S-CBF, uses a SRAM array of couts ad a shared couter. The secod, L-CBF, utilizes a array of up/dow liear feedback shift registers. Circuit level simulatios demostrate that for a K-etry CBF with a 5-bit cout per etry, L-CBF is 3.7 or.6 times faster tha the S-CBF depedig o the operatio. The L-CBF requires 2.3 or.4 times less eergy per operatio compared to the S-CBF. However, the L-CBF requires 3.2 times more area. We demostrate that for oe applicatio of CBFs (early hit/miss detectio for L caches [2] for a aggressive dyamicallyscheduled superscalar processor) the eergy cosumed by the L-CBF is 6% of the eergy cosumed by the S-CBF for most of the SPEC CPU 2 bechmarks. Categories ad Subject Descriptors B.7. Itegrated Circuits, C. Processor Architecture. Geeral Terms Desig, Measuremet, Performace, Experimetatio. Keywords Coutig Bloom Filters, Processors, Delay, Eergy per Operatio.. INTRODUCTION A icreasig umber of architectural techiques rely o hardware coutig bloom filters (CBFs) to improve upo the power, latecy ad complexity of various key processor structures. For example, CBFs have bee proposed to improve performace ad power i soop-coheret multiprocessor or multicore systems [8,9], to improve the scalability of load/store schedulig queues [] ad to reduce replays by assistig i early hit/miss determiatio at the L data cache [2]. I these proposals CBFs elimiate broadcasts over the itercoectio etwork i multiprocessor systems [8], or accesses to much larger Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, or republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. ISLPED 6, October 4 6, 26, Tegersee, Germay. Copyright 26 ACM -59593-462-6/6/...$5... ad thus much slower ad power-hugry cotet addressable memories [], or cache tag arrays [8,9,2]. I all aforemetioed hardware applicatios, the CBF improves the eergy ad latecy of membership tests (e.g., whether a memory block is curretly cached). It does so by providig a defiite aswer for most but ot all tests. Thus, the CBF does ot replace the uderlyig covetioal mechaism (e.g., cache tags). Istead, the CBF dyamically bypasses the covetioal mechaism as frequetly as possible. Accordigly, the beefits obtaied through the use of a CBF deped o how frequetly it ca be utilized ad o the CBF s eergy ad latecy characteristics. The more tests are serviced by the CBF aloe ad the lower the power ad latecy of the CBF, the higher the beefits. Architectural techiques ad applicatio behavior determie how may tests ca be serviced by the CBF. I this work we are oly cocered with implemetatios of CBFs that improve o eergy ad latecy. Coceptually, a CBF is a array of couts for which three operatios are defied: icremet by oe, decremet by oe, ad test if zero. We will refer to the first two operatios as updates ad to the third as a probe. Previous work assumed a straightforward SRAM-based implemetatio which we will refer to as S-CBF (see Sectio 2.). I this work we ivestigate the eergy, latecy ad area of this implemetatio i a commercial.3 µm CMOS techology. However, the key cotributio of this work is L-CBF, a ovel implemetatio of CBFs that relies o up/dow liear feedback shift registers (LFSRs). We demostrate that this implemetatio is sigificatly faster, ad that it requires sigificatly less eergy tha the S-CBF implemetatio. Usig architecture level simulatio of most of the SPEC CPU 2 programs we demostrate that L-CBF ca sigificatly reduce power for the early detectio of L data cache misses [2]. I more detail, the cotributios of this work are as follows: L-CBF, a LFSR-based coutig bloom filter architecture, is proposed. The eergy, latecy ad area of L-CBF ad S-CBF are compared usig their circuit level implemetatio ad full-custom layouts i.3µm fabricatio techology. The relative eergy dissipatio of L-CBF ad S-CBF is compared for the early detectio of L data cache misses for most SPEC CPU 2 programs [2].

To the best of our kowledge this is the first work that ivestigates the eergy, latecy ad area of full-custom implemetatios of CBFs usig a commercial fabricatio techology. The idea of usig LFSRs for the desig of CBF has bee proposed before but o desig or evaluatio of its characteristics was reported [8]. The rest of this paper is orgaized as follows. Sectio 2 reviews CBFs ad the previously assumed S-CBF implemetatio. Sectio 2.2 presets the L-CBF desig. Sectio 3 demostrates the experimetal results. Sectio 4 summarizes our fidigs. 2. COUNTING BLOOM FILTERS Without the loss of geerality, we restrict our attetio to utilizig CBFs for the early detectio of L data cache misses [2]. The cocepts ad implemetatios preseted are directly applicable to other CBF applicatios. I this applicatio, the CBF determies whether a particular block of memory is curretly cached i the L data cache. Give a block address A, the CBF reports whether A appears i ay of the tags of the data cache. The CBF provides two possible aswers: () o, the block is defiitely ot cached, ad (2) ukow, may be the block is cached. I the first case, we determie that A is ot cached ad hece this access will result i a miss. Provided that the CBF is much faster ad dissipates much less power tha the L tag arrays, we maage to obtai the desired aswer much faster ad to save power. I the secod case, the CBF caot provide a defiite aswer ad thus we do have to access the L tags. I this case, we icur a power pealty sice we had to also access the CBF. We may also icur a latecy pealty if the CBF ad the L tag accesses are serialized (we may avoid this latecy pealty if we probe the CBF ad the L tags i parallel, i which case power beefits will be possible oly if we ca termiate the L tag access i progress whe the CBF provides a defiite aswer). As show i Figure, the CBF ca be thought of as a array of couts that is idexed via a hash fuctio of the address A, ad where three operatios are defied: () icremet cout, (2) decremet cout, ad (3) test if cout is zero, or probe. The first two operatios icremet or decremet the correspodig cout by oe, while the probe operatio tests if the cout is zero ad returs true or false (sigle bit output). Simply usig a portio of the address ad ot a more elaborate hash fuctio has bee show to work well [8,2]. Figure 2: S-CBF architecture: A SRAM holds the CBF couts ad updates are implemeted as read-modify-write sequeces. Iitially, all CBF couts are zero ad the L is empty. Whe a block is allocated ito the L, the correspodig CBF etry is icremeted by oe. Whe a block is evicted from the L, the correspodig CBF etry is decremeted by oe. To test whether A curretly exists i the L, we ispect the correspodig CBF cout. If the cout is zero the A is defiitely ot i the L sice we would have icremeted the cout the momet it was cached. If the cout is o-zero the it is ukow whether A is cached. Sice may blocks ca map oto the same CBF cout, it is possible that some other cache block icremeted the cout. Therefore, i this case we eed to check the L tags to determie whether A is cached. It is for the latter reaso that a CBF is a imprecise represetatio of the cached blocks. Specifically, the CBF represets a superset of the cached blocks. A CBF is characterized by the umber of etries it cotais ad the width of the cout of each etry. Multiple CBFs with differet hash fuctios ca be used to improve accuracy [9,2]. I additio, cout values are bouded. Sice the same cout etry is icremeted ad decremeted o a block s allocatio ad evictio respectively, a cout ca ever become egative ad ca ever exceed the umber of the total cache blocks. 2. S-CBF: SRAM-Based CBF Previous work assumed a straightforward CBF implemetatio comprisig a SRAM array to hold the couts, a shared up/dow couter, a zero-comparator ad a small cotroller [9]. This implemetatio is show i Figure 2. Updates are implemeted as read-modify-write sequeces as follows: () the cout is read from the SRAM, (2) it is adjusted usig the couter, ad (3) it is writte back to the SRAM. The probe operatio is implemeted as a read from the SRAM ad the a compariso with zero usig \ Figure : A CBF for cache block address membership test. Ideally, a separate etry would exist for every possible block address A. However, this would result i a prohibitively large table (e.g., a table with 32 millio etries for a processor with a 4Gbyte address space ad 32-byte cache blocks) ad would egate ay beefits. Accordigly, a small table is used ad addresses are hashed oto the table. Hece multiple addresses may map oto the same table etry.

the zero-comparator. A small cotroller coordiates all actios ecessary to implemet these CBF operatios. A optimizatio was proposed to speedup probe operatios ad to reduce their power [9]. Specifically, a extra bit, Z, is added to each cout. Whe the cout is icremeted from zero the Z is set to false ad whe the cout is decremeted to zero the Z is set to true. Probes ca ow simply ispect Z. The Z-bits ca be implemeted as a separate SRAM structure which is faster ad requires much less power. This optimizatio ca be applied to both the S-CBF ad the L-CBF. 2.2 L-CBF As we demostrate i Sectio 3, much of the eergy i S-CBF is cosumed o the bitlies ad wordlies. Latecy ad eergy suffers also because updates require two SRAM accesses per operatio. Fially, the shared couter further icreases eergy ad latecy. We could avoid accesses over log bitlies by buildig a array of up/dow couters. The, that would elimiate the eed to read the value of each couter ad updates would be localized. Ufortuately, up/dow arithmetic couters require may gates ad are slow []. We make the followig two observatios: () the actual cout sequece used i a CBF is ot importat, ad (2) exterally, we oly care whether a cout is zero or o-zero. The L-CBF exploits these two properties to offer the beefits that are possible with a array of up/dow couters while avoidig the overheads associated with usig arithmetic couters. L-CBF uses up/dow LFSRs which offer a better latecy, power ad complexity trade-off tha other o-arithmetic couters. As we demostrate i Sectio 3, L-CBF sigificatly reduces eergy ad latecy compared to S-CBF albeit at the expese of icreased area. However, this is a mior cocer i moder processor desigs for two reasos: () there is a abudace of resources, ad (2) the CBF is tiy compared to most other processor structures (e.g., caches ad brach predictors). It is ulikely that the same resources could improve performace if applied to other processor structures that are already much larger ad optimized. I the rest of this sectio, we review LFSRs, the costructio of up/dow or reversible LFSR couters, ad preset the orgaizatio of L-CBF. 2.2. Liear Feedback Shift Registers I this sectio, we review LFSRs ad the costructio of up/dow LFSR couters. A maximum-legth -bit LFSR couter sequeces through 2 - states. Without the loss of geerality we restrict our attetio to the Galois cofiguratio of LFSRs []. Figure 3 shows a maximum-legth 8-bit LFSR. The LFSR comprises a shift register ad a few XNOR gates. Each bit of the shift register is either shifted as-is to the ext bit (o tap) or is XNORed with the output of bit 7 (tap). By appropriately selectig Figure 3: A 8-bit maximum-legth LFSR. M U X Up Dow Coutig C: CLK R: RESET Q M Q M D Q DX U DX U X 2 C R X C R X C R Normal LFSR: D = Q2 D = Q D2=Q XNOR Q2 g(x) = + X 2 + X 3 the tap locatios it is always possible to build a maximum-legth LFSR of ay width that has either two or four taps [,5]. Furthermore, igorig wire legth delays ad the fa-out of the feedback lie, the delay of the maximum-legth LFSR is idepedet of its size []. As we show i Sectio 3.2, latecy icreases oly slightly as the umber of bits icreases, primarily as a result of icreased capacitace o the cotrol lies. The tap locatios for a maximum-legth -bit LFSR ca be represeted as a primitive polyomial g(x). Figure 3 shows a example of such a polyomial. I geeral, a LFSR ca be expressed as: i g ( x) = Ci X ( C = C i= = ) where X i correspods to the output of the i-th bit of the shift register ad the costats C i are either (o tap) or (tap). This formula represets a ui-directioal ( up ) LFSR. If the primitive polyomial for a maximum-legth -bit LFSR is g(x) (as defied by the precedig formula), the the primitive polyomial h(x) of a LFSR that geerates the reverse sequece is [5]: i h ( x) = C i X ( C = C i= Iverse LFSR: D = Q D = Q2 XNOR Q D2 = Q h(x) = + X + X 3 Figure 4: A 3-bit maximum-legth up/dow LFSR. = ) Up/dow The superpositio of the two LFSRs (the origial ad its reverse) forms a reversible ( up/dow ) LFSR. This reversible LFSR ca be implemeted usig the same shift register, a 2-to- multiplexer per bit to cotrol the directio of the shift ad several XNOR gates. Figure 4 shows the costructio of a 3-bit up/dow LFSR. I geeral, it is possible to costruct a maximum-legth up/dow LFSR of ay width with either two or six XNOR gates (i.e., four or eight taps). 2.2.2 L-CBF Implemetatio Figure 5 shows the high level orgaizatio of the L-CBF. The L-CBF icludes a hierarchical decoder ad several partitios each cotaiig a array of up/dow LFSRs. I each partitio there is a local zero detector per LFSR couter. A hierarchical multiplexer collects these local is-zero sigals ad provides the sigle is-zero output. The L-CBF accepts three iputs ad produces a sigle output is-zero. The 2-bit iput operatio ecodes ay of the three possible operatios or oe. The address lies are used to specify the address i questio ad the reset sigal is used to iitialize all LFSRs to the zero state. A exteral clock source

Figure 5 : L-CBF architecture. is eeded. The LFSRs use two o-overlappig phase clocks that are geerated iterally from the exteral clock sigal. We use a hierarchical decoder for decodig the address to miimize the eergy ad delay [2]. The decoder cosists of a pre-decodig stage, a global decoder that selects the appropriate partitio, ad a set of local decoders, oe per partitio. Each partitio cotais a array of up/dow LFSRs. Each row i each partitio cotais a up/dow LFSR ad its zero detector. Fially, we use a hierarchical multiplexer for selectig the appropriate zero-detector output for the is-zero operatio. Figure 6 shows the basic cells we used to implemet each LFSR ad the zerodetector. Show are the flip-flop for the shift register cells, the multiplexers that cotrol the directio of chage ( up / dow ), the XNOR gate, ad a bit-slice of the zero-detector. Due to space limitatios we do ot provide additioal details o the L-CBF implemetatio. 3. EXPERIMENTAL RESULTS I this sectio, we compare the eergy cosumptio, delay, ad area of the S-CBF ad L-CBF implemetatios. We first compare the desigs o a per operatio basis ad the report eergy savigs with L-CBF over S-CBF for L hit/miss detectio usig architectural simulatio of several SPEC 2 bechmarks. We implemeted all desigs usig Cadece(R) tools i a commercial.3µm fabricatio techology. We did ot use a automated process to geerate the desigs. Istead, we used full-custom desig ad attempted to optimize the eergy ad latecy of both desigs as much as possible. We used the Spectre simulator for circuit simulatios. This is the vedor recommeded simulator for desig validatio prior to maufacturig. The rest of this sectio is orgaized as follows. We iitially cosider a K-etry CBF with 5-bit etries as it is represetative of the CBFs used i previous proposals [9,2]. I Sectio 3. we compare the eergy, delay ad area of the two desigs for each of the three operatios (icremet, decremet ad probe). I Sectios 3.2 we study how eergy ad delay chage as we vary the umber of etries, the width of the couters ad the umber of taps. I Sectio 3.3 we demostrate that L-CBF ca reduce eergy to 6% compared to S-CBF whe used for early L cache hit/miss determiatio. 3. Delay ad Eergy per Operatio We compare implemetatios of a K-etry, 5-bit cout per etry CBF. For S-CBF, we use a SRAM with a total capacity of 5Kbits. We partitioed the SRAM i order to miimize its power/delay product. For the S-CBF we do ot cosider the delay ad eergy overhead of the shared couter sice our goal is to demostrate that the L-CBF cosumes less eergy ad it is also faster. To further reduce eergy for probes i the S-CBF desig, we itroduce a extra bit per etry which is updated oly whe the cout chages from or to zero as described i Sectio 2. (Z-bits). O a probe, we oly read this bit. Furthermore we applied a umber of latecy ad power optimizatios o the S-CBF [2,3,7,4]. The Divided Word Lie (DWL) techique which adopts a two-stage hierarchical row decoder structure was used to improve speed ad power [3,7]. Power was further reduced via pulse operatio techiques for the word-lies, the periphery circuits ad the sese amplifiers [7]. We also used multi-stage static CMOS decodig [4] ad curret-mode read ad write operatios for further power reductio [7]. For the L-CBF implemetatio we use 6-bit LFSRs so that the LFSR ca cout at least 2 5 values. Table shows the delay i picosecods, the eergy (static ad dyamic) per operatio i picojoules ad the area i square micrometers for both the L-CBF ad the S-CBF. The last colum reports the ratio of S-CBR over L-CBF per metric. We report two rows per category, oe for the update ad oe for the probe operatio. For delay ad eergy we report the worst case which we measured selectig appropriate iput vectors. Give that we do ot cosider the overhead (latecy ad eergy) of the shared couter, the measuremets for the S-CBF are optimistic. Table. Eergy, delay ad area of the S-CBF ad L-CBF implemetatios of a K-etry, 5-bit CBF. Figure 6: The cells used to implemet each up/dow LFSR: (a) the two-phase flip-flop (b) the 2-to- mux (c) XNOR gate, (d) a bit-slice of the embedded zero detector. Delay (ps) Eergy (pj) Operatio L-CBF S-CBF S-CBF/ L-CBF ic/dec 447.26 67 3.7 probe 58.32 9.2.6 ic/dec 38.73 88.98 2.3 probe 3.36 4.2.4 Area (um 2 ) 945825 29557.3

Memory Core 79% Decoder SA 7% 9% 5% Others S-CBF 74% 7% Memory Core Decoder 2% SA Others The L-CBF is 3.7 ad.6 times faster tha the S-CBF durig updates ad probes respectively. I additio, the L-CBF cosumes 2.3 ad.4 times less eergy compared to the S-CBF for updates ad probes respectively. These sigificat gais i speed ad eergy cosumptio come at the expese of icreased area. The L-CBF is about 3.2 times larger tha the S-CBF. As metioed i Sectio 2.2 this is less of a cocer i moder processor desigs. Figure 7 shows a per compoet breakdow of eergy cosumptio for the two desigs ad for the two operatio categories. For the S-CBF, we ca observe that most of the eergy (79% ad 74% respectively for updates ad probes) is cosumed by the memory core (worldlies, bitlies ad SRAM cells). The decoder ad the sese-amplifiers cosume cosiderably less eergy. This is expected as we applied aggressive eergy ad latecy optimizatios to these compoets. Fially, a small percetage of the overall eergy is cosumed by peripheral circuitry such as the precharge ad write logic. For the update operatios of the L-CBF, 2% of the total power is dissipated due to the leakage i the zero-detectors that are iactive durig updates. Same reasoig applies to 33% of the total power that is dissipated i the iactive parts durig the probe. 3.2 Sesitivity Aalysis Thus far we have focused o a specific CBF. I this sectio we vary the umber of etries ad the width of the couts. Figure 8 reports the eergy per operatio for CBFs of 64 through K etries i power of two steps. We observe that the L-CBF always cosumes less eergy tha the S-CBF ad that the relative differece icreases slightly for larger etry couts. Figure 9 reports the eergy per operatio as a fuctio of cout width i the rage of 4 to 6 bits. I this experimet we limit our attetio to a 64-etry CBF. Alog the L-CBF measuremets we also report the umber of taps eeded by each cout width (either four or eight). We observe that L-CBF s eergy scales better tha S-CBF s. L-CBF eergy icreases slightly for wider couts. Commuicatio i the L-CBF is primarily betwee adjacet cells. For this reaso, icreasig the umber of cells does ot impact overall eergy sigificatly. S-CBF s eergy icreases at a greater rate because additioal bitlies ad sese amplifiers are itroduced ad to a lesser extet because the wordlies become 7% LFSRs Decoder & 65% 33% 2% Row Drivers & mux Zero Detectors L-CBF LFSRs Decoder & 57% 33% 2% Row Drivers & mux Zero Detectors Figure 7: Per compoet eergy cosumptio for the S-CBF ad the L-CBF desigs. Two sets of results are show per desig, oe for the update operatios (Ic/Dec) ad oe for the probe operatio. Eergy per Operatio (Pj) 9 L-CBF(INC/DEC) 8 S-CBF(INC/DEC) 7 L-CBF(Probe) 6 S-CBF(Probe) 5 4 3 2 28 256 384 52 64 768 896 24 # of row s Figure 8: Eergy per operatio as a fuctio of etry cout for L-CBF ad S-CBF for 5-bit couts. Eergy Per Operatio (pj) 8 7 6 5 4 3 2 probe(l-bcbf) ic/dec(l-bcbf) ic/de c(s-bcbf) probe(s-bcbf) 4 4 8 4 4 4 2 4 6 8 2 4 6 Couter Size Figure 9: Eergy per operatio as a fuctio of cout width for a 64-etry CBF. loger. As it ca be see i Figure 9 chagig the umber of taps i the L-CBF does ot sigificatly impact eergy. 3.3 Eergy Savigs for Early Hit/Miss Detectio Fially, we demostrate that L-CBF ca reduce eergy sigificatly compared to S-CBF for a practical applicatio. We cosider early L data cache hit/miss detectio as proposed by Peir et. al. [2]. Early hit/miss detectio ca sigificatly reduce the umber of istructio schedulig replays due to L data cache misses ad hece improve performace. We used Simplescalar v3. [6] to simulate the processor detailed i Table 2. We compiled the SPEC CPU 2 bechmarks for the Alpha 2264 architecture usig HP s compilers ad for the Digital Uix V4.F usig the SPEC suggested default flags for peak optimizatio. We used a referece iput data set all the bechmarks. To obtai reasoable simulatio times, samples were take for five billio committed istructios per bechmark. We skipped billio committed istructios prior to collectig

measuremets for all bechmarks except for art ad parser for which we oly skipped 2 billio istructios. We simulate a 52-etry CBF with -bit couts. The CBF is idexed usig ie cotiuous address bits startig immediately after the last bit that is used as a offset withi a cache block. Figure shows the ratio of the eergy cosumed by the L-CBF over the eergy cosumed by the S-CBF for this applicatio. A breakdow also i terms of updates ad probes is show. Overall, L-CBF reduces eergy by about 4%. Should a larger CBF was used; the eergy savigs would be higher. Moreover, i this experimet we do ot cosider ay eergy savigs that would be possible by voltage scalig i the L-CBF. Because the L-CBF is faster tha the S-CBF it may be possible to reduce power further by scalig its voltage supply. Eergy(L-CBF/S-CBF) Table 2. Base processor cofiguratio Brach Predictor Fetch Uit 8K-etry GShare ad 8K-etry bi-modal 6K selector 2 braches per cycle Issue/Decode/Commit ay /8 istr. per cycle FU Latecies Default simplescalar values LD/LI Geometry 64KBytes, 4-way set-associative with 64-byte blocks.8.6.4.2 Probe Up to 8 istr. per cycle 64-etry Fetch Buffer No-blockig I-Cache Scheduler 28-etry 64-etry LSQ Mai Memory Ifiite, 2 cycles UL2 Geometry 2Mbytes, 8-way set-associative with 64-byte blocks Cache Replacemet LD/LI/L2 Latecies 3/3/6 cycles LRU Fetch/Decode/Commit Latecies 4 cycles + cache latecy for fetch INC/DEC gzip wupw swim mgrid applu vpr gcc mesa galgel art mcf equak crafty facere ammp lucas fma3d parse eo gap vortex bzip2 twolf apsi Figure : Eergy ratio of L-CBF over S-CBF for a 52-etry, -bit cout CBF for early L data cache hit/miss detectio. 4. SUMMARY We preseted two desigs of CBFs, oe based o a SRAM array of couts ad oe based o a array of liear feedback shift register couters. We evaluated the eergy, latecy ad area of the two implemetatios of CBFs usig a commercial semicoductor techology. Fially, we studied eergy cosumptio for a practical applicatio of CBFs usig architectural simulatio. The LFSR-based CBF desig is superior to the SRAM-based CBF desig i both latecy ad eergy at the expese of more area. ACKNOWLEDGMENTS We would like to thak Farid Najm ad the aoymous reviewers for their commets. We are grateful to.moha..mmad. Haji.rostam for his assistace i circuit desig ad simulatios ad Navid. Azizi.for his help with SRAM desig. This work was supported by a NSERC Discovery Grat, a Caada Foudatio for Iovatio Equipmet Grat, a Itel Research Coucil Grat ad by Semicoductor Research Corporatio uder cotract #9.. REFERENCES [] P. Alfke, Efficiet Shift Registers, LFSR Couters, ad Log Pseudo-Radom Sequece Geerators, Xilix, Applicatio Note 52, Jul. 996. [2] B. S. Amrutur ad M. A. Horowitz, Fast Low-Power Decoders for RAMs, IEEE Joural of Solid-State Circuits, 36():56-55, Oct. 2. [3] B. S. Amrutur, Desig ad Aalysis of Fast Low Power SRAMs, Ph.D. dissertatio, Electrical Egieerig Departmet, Staford Uiversity, 999. [4] B. S. Amrutur ad M. A. Horowitz, Speed ad Power Scalig of SRAM's, IEEE Joural of Solid-State Circuits, 35(2):75-85, Feb. 2. [5] P. H. Bardell, W. H. McAey, ad J. Savir, Built-I Test for VLSI: Pseudoradom Techiques, Joh Wiley & Sos, Ic., 987. [6] D. Burger ad T. Austi. The Simplescalar Tool Set v2., Techical Report UW-CS-97-342, Computer Scieces Departmet, Uiversity of Wiscosi-Madiso, Ju. 997. [7] M. Margala, Low-power SRAM Circuit Desig, I Proc. of IEEE Workshop o Memory Techology, Desig ad Testig, 5-22, Aug. 999. [8] A. Moshovos, RegioScout: Exploitig Coarse-Grai Sharig i Soop-Coherece, I Proc. Aual Iteratioal Symposium o Computer Architecture, Ju. 25. [9] A. Moshovos, G. Memik, B. Falsafi, ad A. Choudhary, Jetty: Filterig Soops for Reduced Eergy Cosumptio i SMP Servers, I Proc. of the Aual Iteratioal Coferece o High-Performace Computer Architecture, 85 96, Feb. 2. [] S. Sethumadhava, R. Desika, D. Burger, C.R. Moore, S.W. Keckler, Scalable Hardware Memory Disambiguatio for High-ILP Processors, IEEE Micro, 24(6):8-27, Nov. 24. [] M. R Sta, Sychroous Up/Dow Couter with Clock Period Idepedet of Couter Size, I Proc. IEEE Symposium o Computer Arithmetic, 274-28, Jul. 997. [2] J. K. Peir, S.C. Lai, S.L. Lu, J. Stark, ad K. Lai, Bloom Filterig Cache Misses for Accurate Data Speculatio ad Prefetchig, I Proc. Aual Iteratioal Coferece o Supercomputig, Ju. 22.