Achieving 550 MHz in an ASIC Methodology

Similar documents
Chapter 7 Registers and Register Transfers

Logistics We are here. If you cannot login to MarkUs, me your UTORID and name.

Read Only Memory (ROM)

EE260: Digital Design, Spring /3/18. n Combinational Logic: n Output depends only on current input. n Require cascading of many structures

Energy-Efficient FPGA-Based Parallel Quasi-Stochastic Computing

Line numbering and synchronization in digital HDTV systems

Quality improvement in measurement channel including of ADC under operation conditions

Design Project: Designing a Viterbi Decoder (PART I)

Reliable Transmission Control Scheme Based on FEC Sensing and Adaptive MIMO for Mobile Internet of Things

DIGITAL SYSTEM DESIGN

What Does it Take to Build a Complete Test Flow for 3-D IC?

Mullard INDUCTOR POT CORE EQUIVALENTS LIST. Mullard Limited, Mullard House, Torrington Place, London Wel 7HD. Telephone:

STx. Compact HD/SD COFDM Transmitter. Features. Options. Accessories. Applications

T-25e, T-39 & T-66. G657 fibres and how to splice them. TA036DO th June 2011

ISSCC 2003 / SESSION 19 / PROCESSOR BUILDING BLOCKS / PAPER 19.5

References and quotations

Implementation of Expressive Performance Rules on the WF-4RIII by modeling a professional flutist performance using NN

Randomness Analysis of Pseudorandom Bit Sequences

NewBlot PVDF 5X Stripping Buffer

RELIABILITY EVALUATION OF REPAIRABLE COMPLEX SYSTEMS AN ANALYZING FAILURE DATA

Image Intensifier Reference Manual

2 Specialty Application Photoelectric Sensors

L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture

PROBABILITY AND STATISTICS Vol. I - Ergodic Properties of Stationary, Markov, and Regenerative Processes - Karl Grill

EE-382M VLSI II FLIP-FLOPS

11. Sequential Elements

MODELLING PERCEPTION OF SPEED IN MUSIC AUDIO

Voice Security Selection Guide

Motivation. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics

Polychrome Devices Reference Manual

THE Internet of Things (IoT) is likely to be incorporated

Australian Journal of Basic and Applied Sciences

Lecture 26: Multipliers. Final presentations May 8, 1-5pm, BWRC Final reports due May 7 Final exam, Monday, May :30pm, 241 Cory

8825E/8825R/8830E/8831E SERIES

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

CCTV that s light years ahead

Image Enhancement in the JPEG Domain for People with Vision Impairment

PowerStrip Automatic Cut & Strip Machine

Higher-order modulation is indispensable in mobile, satellite,

Research on the Classification Algorithms for the Classical Poetry Artistic Conception based on Feature Clustering Methodology. Jin-feng LIANG 1, a

Manual RCA-1. Item no fold RailCom display. tams elektronik. n n n

Innovation in the Multi-Screen World. Sirius 800 Series. Multi-format, expandable routing that stands out from the crowd

Analyzing the influence of pitch quantization and note segmentation on singing voice alignment in the context of audio-based Query-by-Humming

EE241 - Spring 2007 Advanced Digital Integrated Circuits. Announcements

CODE GENERATION FOR WIDEBAND CDMA

PROJECTOR SFX SUFA-X. Properties. Specifications. Application. Tel

The Blizzard Challenge 2014

High performance and Low power FIR Filter Design Based on Sharing Multiplication

Comparative study on low-power high-performance standard-cell flip-flops

Manual Industrial air curtain

Internet supported Analysis of MPEG Compressed Newsfeeds

Working with PlasmaWipe Effects

Forces: Calculating Them, and Using Them Shobhana Narasimhan JNCASR, Bangalore, India

Logic Design II (17.342) Spring Lecture Outline

A Simulation Experiment on a Built-In Self Test Equipped with Pseudorandom Test Pattern Generator and Multi-Input Shift Register (MISR)

9311 EN. DIGIFORCE X/Y monitoring. For monitoring press-fit, joining, rivet and caulking operations Series 9311 ±10V DMS.

NIIT Logotype YOU MUST NEVER CREATE A NIIT LOGOTYPE THROUGH ANY SOFTWARE OR COMPUTER. THIS LOGO HAS BEEN DRAWN SPECIALLY.

The new, parametrised VS Model for Determining the Quality of Video Streams in the Video-telephony Service

Load-Sensitive Flip-Flop Characterization

II. ANALYSIS I. INTRODUCTION

2 Specialty Application Photoelectric Sensors

Lecture 6. Clocked Elements

Lecture 21: Sequential Circuits. Review: Timing Definitions

ttco.com

VOCALS SYLLABUS SPECIFICATION Edition

NexLine AD Power Line Adaptor INSTALLATION AND OPERATION MANUAL. Westinghouse Security Electronics an ISO 9001 certified company

SMARTEYE ColorWise TM. Specialty Application Photoelectric Sensors. True Color Sensor 2-65

RHYTHM TRANSCRIPTION OF POLYPHONIC MIDI PERFORMANCES BASED ON A MERGED-OUTPUT HMM FOR MULTIPLE VOICES

GLITCH FREE NAND BASED DCDL IN PHASE LOCKED LOOP APPLICATION

PIANO SYLLABUS SPECIFICATION. Also suitable for Keyboards Edition

Registers. Unit 12 Registers and Counters. Registers (D Flip-Flop based) Register Transfers (example not out of text) Accumulator Registers

Combining Dual-Supply, Dual-Threshold and Transistor Sizing for Power Reduction

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

THE USE OF forward error correction (FEC) in optical networks

Logic Design. Flip Flops, Registers and Counters

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

12-bit Wallace Tree Multiplier CMPEN 411 Final Report Matthew Poremba 5/1/2009

Introduction. NAND Gate Latch. Digital Logic Design 1 FLIP-FLOP. Digital Logic Design 1

Digital System Clocking: High-Performance and Low-Power Aspects

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

Previous Lecture Sequential Circuits. Slide Summary of contents covered in this lecture. (Refer Slide Time: 01:55)

Comparative Study of Different Techniques for License Plate Recognition

BesTrans AOC (Active Optical Cable) Spec and Manual

UNIT III COMBINATIONAL AND SEQUENTIAL CIRCUIT DESIGN

IN DIGITAL transmission systems, there are always scramblers

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

A Backlight Optimization Scheme for Video Playback on Mobile Devices

Research Article Measurements and Analysis of Secondary User Device Effects on Digital Television Receivers

Application Example. HD Hanna. Firewire. Display. Display. Display. Display. Display. Computer DVD. Game Console. RS-232 Control.

Chapter 2. Digital Circuits

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

Our competitive advantages : Solutions for X ray Tubes. X ray emitters. Long lifetime dispensers cathodes n. Electron gun manufacturing capability n

(CSC-3501) Lecture 7 (07 Feb 2008) Seung-Jong Park (Jay) CSC S.J. Park. Announcement

Design of New Dual Edge Triggered Sense Amplifier Flip-Flop with Low Area and Power Efficient

EEC 118 Lecture #9: Sequential Logic. Rajeevan Amirtharajah University of California, Davis Jeff Parkhurst Intel Corporation

2 Specialty Application Photoelectric Sensors

FLUID COOLING Industrial BOL Series

EMT 125 Digital Electronic Principles I CHAPTER 6 : FLIP-FLOP

Manual Comfort Air Curtain

DIGITAL SYSTEM FUNDAMENTALS (ECE421) DIGITAL ELECTRONICS FUNDAMENTAL (ECE422) LATCHES and FLIP-FLOPS

Transcription:

Achievig Mz i a ASIC Methodology D. G. Chiery, B. Nikolić, K. Keutzer Departmet of Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley {chiery, bora, keutzer}@eecs.berkeley.edu ABSTRACT Typically, good automated ASIC desigs may be two to five times slower tha hadcrafted custom desigs. At last year's DAC this was examied ad causes of the speed gap betwee custom circuits ad ASICs were idetified. I particular, faster custom speeds are achieved by a combiatio of factors: good architecture with well-balaced pipelies; compact logic desig; timig overhead miimizatio; careful floorplaig, partitioig ad placemet; dyamic logic; post-layout trasistor ad wire sizig; ad speed biig of chips. Closig the speed gap requires improvig these same factors i ASICs, as far as possible. I this paper we examie a practical example of how these factors may be improved i ASICs. I particular we show how techiques commoly foud i custom desig were applied to desig a highspeed Mz disk drive read chael i a ASIC desig flow. Geeral Terms Performace, desig. Keywords ASIC,, frequecy, speed, throughput, compariso, custom.. INTRODUCTION Typically, speeds of good applicatio-specific itegrated circuits (ASICs) lag that of the fastest custom circuits i the same processig geometry by factors of two or more. I decreasig order of importace, we have show [] that custom desigs are faster due to use of dyamic logic o critical paths; speed-biig of chips; timig overhead miimizatio with tree desig for skew miimizatio, ad good latch ad flip-flop desig; ad custom trasistor ad wire sizig. Table gives our overview of the maximum cotributios of various factors to the speed differetial betwee ASICs ad custom ICs. These are similar to those preseted i [], but ig related issues have bee gathered i a sigle headig. A custom processor desiger has a full rage of desig style choices. These iclude architecture ad micro-architecture, logic desig, floorplaig ad physical placemet, ad choice of logic family. Also, circuits ca be optimized by had ad trasistors idividually sized for speed, lower power, ad lower area. Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. To copy otherwise, or republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. DAC, Jue 8-,, Las Vegas, Nevada, USA. Copyright ACM -8-9-// $.. Table. Maximum differeces betwee custom ad ASIC. A factor of. idicates o differece. FACTORS CONTRIBUTING TO CUSTOM BETTER TAN ASICs vs. poor ASIC vs. best practice ASIC micro-architecture: pipeliig; logic desig.. process variatio ad accessibility.. dyamic logic o critical paths.. timig overhead: tree distributio; latch/flip-flop desig.. floorplaig ad placemet.. sizig of trasistors ad wires.. ASIC tools caot hadle dyamic logic ad the process variatio gap caot be closed fully, but ASIC desigs ca be improved i a variety of ways []. To show how the gap betwee ASIC ad custom ca be closed, we examie a ASIC disk drive read chael chip that achieves a frequecy of Mz i. µm CMOS []. The speed of Mz is comparable to custom desig speeds. This example illustrates the importace of some of the priciples outlied i last year s presetatio, such as partitioig ad duplicatio of logic to achieve higher throughput. Additioally, we show that the ig overhead i ASIC desigs ca be reduced with careful tree desig to esure predictable skew, ad latch-based desig. The key opportuities for closig the gap betwee ASIC ad custom form the orgaizig priciple of this paper. Specifically, i Sectio we overview the desig example, ad examie its micro-architecture i Sectio. I Sectio we discuss timig ad latch desig, ad Sectio discusses tree distributio. Sectio compares ASIC ad custom logic desigs. Sectio examies cell ad wire sizig, while Sectio 8 looks at cotrollig process variability. Fially, Sectio 9 reflects our coclusios.. A DESIGN BRIDGING TE SPEED GAP BETWEEN ASIC AND CUSTOM Disk drive read chaels are a high-speed sigal processig applicatio. Data rates i curret high-performace commercial products are i the rage of - Mb/s [,, ] ad demad icreasigly high speeds. For example, Marvell s 88C has a throughput of.gbit/s with.8 um techology ad. V supply []. We examie a competitive ASIC disk drive read chael desig, the Texas Istrumets SP i. µm CMOS (.8 µm L eff ) with.9 V supply []. It is based o EPR equalizatio [], operates at Mz, dissipates at most. W at full speed, ad has Mb/s user data rate. This speed is comparable to customdesiged read chaels i similar techology [].

Write Sigal Read Sigal Precomp VGA CT Filter Ecoder ADC Timig recovery Scrambler Equalizer Detector Servo Write Data Figure : Disk drive read chael block diagram. x() h h h h + + + Figure : Direct form FIR. x() h h h Delay elemet e.g. flip-flop: y() h + + + Decoder Descrambler Figure : Traspose-form FIR. Good sythesis requires a rich library with sufficietly may drive stregths []. This ASIC used TI s stadard cell library, with to gate sizes for stadard cells, ad buffer/iverter sizes. To reduce timig overhead, some custom desiged cells were characterized for a ASIC flow ad used i the SP, (e.g. the memory elemets o the critical paths, such as the SAFF), ad their drive stregth was matched to their typical load. The SP was a etirely ew desig take from applicatio cocept, icludig ew algorithms ad architecture, to circuit realizatio, i a ew process, i ie moths. We cocetrate o techiques to speed up bottleecks (the Viterbi detector ad adaptive equalizer i the read path) i the chip s digital portio.. Read ad Write Data The block diagram i Figure represets most of today s read chaels. The write data is scrambled, rulegth ecoded, precompesated for magetic chael oliearities, ad fed to the write head. The read sigal is read by the read head, preamplified, ad processed by the read chael. O the read side, the sigal from the preamplifier is coditioed by the variable-gai amplifier (VGA) ad cotiuous-time (CT) filter, before aalog-to-digital coversio (ADC). Besides atialiasig, the cotiuous time filter partially equalizes the data. After the ADC, the data is processed digitally. The key blocks are the digital adaptive equalizer, the Viterbi detector, the rulegth decoder ad the descrambler. The the data is coverted from serial bit data to byte data, ad ca the be processed at a lower speed. The timig recovery ad servo blocks use equalized ad detected data.. Digital Portio Speed Bottleecks Due to icreasig storage desities, to limit oise ehacemet read chaels use partial-respose equalizatio, which is ofte doe by fiite impulse respose (FIR) filterig. Viterbi detectio resolves the remaiig iter-symbol iterferece. The FIR filter performs partial-respose equalizatio, with least-mea squares (LMS) algorithm adapted taps. The FIR filter critical path has a slow multiply-accumulate operatio, which is ot recursive, so pipeliig ad parallelizatio ca achieve the desired throughput, at the expese of icreased area. Whereas, the sigle-cycle depedecy of the Viterbi algorithm prevets pipeliig, ad reducig the timig overhead is the oly way to icrease speed. y() Read Data x() odd x() eve y [ ] = h[ k] x[ k] () k = h h h h h h + + + + h h h h + + + + Figure : Two-path parallel traspose FIR. sm - sm - t - bm bm bm sm bm sm time Figure : Viterbi algorithm two-state trellis.. MICRO-ARCITECTURE: PIPELINING AND LOGIC DESIGN Frequetly, micro-architectural trasformatios reduce the critical path i sigal processig datapaths. Differet trasformatios are applicable to structures with ad without cycle depedecy (e.g. FIR filter ad Viterbi detector respectively) []. Micro-architectural exploratio is much easier usig a DL descriptio, makig ASIC desig iteratio a order of magitude faster. Custom layouts have to be redoe by had to explore alterative structures [], whereas, high-level DL ca be quickly rewritte ad the ASIC tools produce the correspodig layout for evaluatio. As Table shows, micro-architecture offers the greatest potetial for speed improvemet, ad we will dedicate most of this paper to detailig the micro-architectural improvemets that etted the pricipal speed gais.. FIR filter Several trasformatios ca speed up the multiply-accumulate operatio i the FIR filter critical path. LMS update of coefficiets adds feedback recursio to the FIR implemetatio, but employig delayed or semi-static LMS allows the critical path to be pipelied (LMS coefficiets are updated before the read cycle with the traiig sequece). The SP uses several techiques to shorte the critical path []. The FIR equatio is: Direct implemetatio of the FIR equatio gives the direct-form FIR [9], show i Figure. Some possible trasformatios for reducig the critical path are show i Figures ad. Iheret pipeliig ca be achieved by trasposig the data flow graph, givig a traspose-form FIR, show i Figure. To further reduce the delay, the FIR ca be iterleaved ad computed i parallel []. For m parallel paths, the area icreases liearly with m, ad the multiply-add is performed at /m of the data rate. Figure shows a two-path parallel traspose type FIR, performig multiply-add at half the data rate. Booth recodig ca speed up multiplicatio, reducig the multiply-accumulate delay more []. With these architectural improvemets, the SP FIR achieves a Mz frequecy, ad Mb/s throughput (ecoded data is read at Mb/s, ad the throughput is Mb/s after redudacy removal). t y() odd y() eve

( ) ( ) sm = mi sm + bm, sm + bm sm = mi sm + bm, sm + bm Select Add Add Compare Figure : Illustratio of the add-compare-ect recursive cycle depedece i the Viterbi algorithm. + + + + Figure : Oe-step lookahead applied to 8-state Viterbi decoder []. +. Viterbi decoder The Viterbi algorithm has tight, sigle-cycle recursio. The Viterbi algorithm is commoly expressed i terms of a trellis diagram, which is a time-idexed versio of a state diagram. The simplest -state trellis is show i Figure. Maximizig probabilities of paths through a trellis of state trasitios (braches) determies the most likely sequece, for a iput digital stream with iter-symbol iterferece. The brach metric (bm) is the cost of traversig alog a specific brach, as idicated i Figure. State metrics (sm) accumulate the miimum cost of arrivig ito a specific state. The path fially take through the trellis is a survivor sequece, the most likely sequece of recorded data. Figure shows a two-state example. The Viterbi detector is a processor that implemets the Viterbi algorithm, ad cosists of three major blocks: the cetral part is the add-compare-ect uit (ACS); the brach metric calculatio uit; ad the survivor path decodig uit. Efficiet desig of the ACS, which is a oliear feedback loop, is crucial to achieve a high throughput to circuit area ratio. The ACS calculates the sums of the state metrics (sm) with correspodig brach metrics (bm) ad ects the maximal (or miimal) to be the ew state metrics. The throughput depeds highly o the ACS additio ad compariso implemetatios. The compariso is frequetly doe via subtractio, ad the carry profile iside the adders ad subtractors determies the speed. Architectural trasformatios, like loop urollig ad retimig ca be applied to a ACS recursio to reduce the critical path. Applyig a oe-step look-ahead to the ACS theoretically roughly doubles the throughput. owever, i the deep submicro, this speed gai is reduced oly to % by icreased wirig overhead while the area icreases by a factor of.. Figure shows the trasformed algorithm, usig a four-way ACS operatio. Ofte, retimig is used to icrease throughput, ad it ca be used for recursive algorithms. Trasformig ad retimig the ACS to perform compare-ect-add (CSA) [, ] removes the compariso from the critical path, as show i Figure 8 []. Further speed improvemets are possible usig a redudat umber system ad carry-save additio []. This allows deeper bit-level pipeliig of the ACS, icreasig the speed. A practical sm i - sm j - bm j,k (a) Stadard ACS - - sm k sm k (c) Retimed CSA (b) Trasformatio to CSA bm j,k Figure 8. Trasformig ACS to CSA. realizatio with a dyamic pipelie ad latches was show i []. The SP Viterbi decoder rus at a frequecy of Mz, with a user throughput of Mb/s (recordig code rate reduces the user data rate). - - bm j,k. TIMING AND LATC DESIGN With deep pipeliig achieved by architectural trasformatios, the timig overhead fractio of cycle time icreases. Reducig the impact of skew ad better timig elemet desig ca sigificatly improve ASIC speeds. Typical high-performace custom desigs keep the timig overhead dow to % to % [, 8]. I a Mz desig, this is as little as. s, whereas i commo ASIC methodology the timig overhead is about. s. ASIC desigs typically use flip-flops, which preset hard boudaries betwee the pipelie stages, which must be well balaced as there is o slack passig. Also, the timig budget has to iclude skew. Latches allow slack passig ad are skew isesitive. Latches are well supported by the sythesis tools [], but are rarely used other tha i custom desigs. We believe that with latches ad good ig methodology, the speed impact of timig overhead o ASIC desigs ca be reduced from about % worse tha custom to about % worse.. Latch Slack Passig ad Time Borrowig Time (slack) ot used by the combiatioal logic delay i oe pipelie stage is automatically passed to the ext stage i latch based desigs. Likewise, a logic stage ca borrow time from the succeedig stage to complete the required fuctio []. Figure 9 illustrates slack passig ad time borrowig. Let t D-Q be the time from data beig ready at the latch iput D to the latch output Q becomig valid, whe the data trasitios whe the latch is trasparet. Let t Clk-Q be the time from arrivig to the latch output Q becomig valid, whe the data is ready prior to the latch becomig trasparet. The delay of combiatioal block i is t combi. As data at D is available before the latch becomes trasparet, output Q becomes valid after t Clk-Q, ad combiatioal logic block ca start evaluatig but D could have arrived as late as the setup time, so it is passig slack o to evaluatio of combiatioal logic block. The latest combiatioal logic block ca fiish is by the setup time of the trasparet low latch ad it does so, as it has a log critical path but D could have arrived as early as t Clk-Q after the to latch L wet low, so time has bee borrowed from whe combiatioal logic block could have started evaluatig.

Latch setup times: L trasparet low trasparet high L ad L are trasparet high latches, L is a trasparet low latch. t Clk-Q time borrowed by combiatioal logic L D Q Vdd Q Q D Q D Q L combiatioal logic L D Q D Clk D Q t Clk-Q combiatioal logic Figure : ybrid latch-flip-flop (LFF). t comb slack passed to combiatioal logic t D-Q t comb t D-Q L D Q Figure 9: Maximum slack passig ad time borrowig betwee the stages []. Clock slew: Latch setup times: L trasparet low trasparet high D L is a trasparet high latch, L is a trasparet low latch. L Q cycle time D L combiatioal logic Q D D L Q Q t comb t comb combiatioal logic t D-Q t D-Q t D-Q Figure : Skew-tolerat level sesitive ig [].. Skew Tolerace Figure shows skew tolerace. The skew does ot affect the miimum cycle time if the logest combiatioal logic path always arrives after the latch becomes trasparet, ad before the setup time plus the skew. The miimum cycle time is: t comb + td Q + tcomb + t () D Q. Latch-Based FIR Filter Desig The SP FIR filter uses latches ad time borrowig []. The parallel traspose-type FIR architecture has two critical timig paths: from the ADC output through the multiply-accumulate; ad from the previous latch output through additio to the ext latch stage. I this implemetatio, the third path, for the coefficiet update, is ot a issue sice the coefficiets are semi-static. As the architecture is split ito two paths, the timig critical multiplyaccumulate operatio is implemeted at half the data rate. This implemetatio of the FIR filter is based o oe-hot Booth ecodig of -bit data. Ecoded data is distributed to all the taps of the filter. Each tap coefficiet is pre-multiplied for C, -C, -C, -C,, C, C, C, 8C, ad a correct partial product is ected (usig a custom 9: multiplexer) by the ecoded data. This sigificatly simplifies the multiplicatio. The use of latches aturally allows time borrowig betwee the FIR taps. Also, sice the equalizer is the first block that follows the ADC, its tree isertio delay ca be icreased to absorb the additioal delay required for data ecodig. Figure : Modified sese-amplifier-based flip-flop (SAFF).. Viterbi Decoder As the Viterbi decoder has tight recursio, slack passig caot be used. Istead, faster flip-flops ca reduce the ig overhead. I a edge-triggered system, cycle time T has to meet the followig relatioship: T t + t + t + t () Clk Q comb su skew I the critical path, the flip-flop delay is the sum of setup time ad the -to-output delay, t su + t Clk-Q. The skew is t skew, the combiatioal delay is t comb, ad the hold time is t h. Pulsed latches have less total delay, latecy, tha (master-slave) latch pairs. Examples are the hybrid latch-flip-flop (LFF) [], Figure, ad modified sese-amplifier-based flip-flop (SAFF) [], Figure. Their first stage is a pulse geerator, ad the secod stage is a latch to capture the pulse. The LFF geerates a egative pulse whe D =, which is captured by the D-type latch. The SAFF geerates a egative pulse o either S or R, which triggers the SR latch. Similar flip-flop desigs are used i custom processors such as the DEC Alpha ad the StrogArm.

From gated source To about ed elemets Area (cells) 8 CSA Figure : Prescribed tree. Some pulsed latches exhibit a soft edge property, which ca be accouted for durig cell characterizatio for skew tolerace [8]. owever, characterizatio must iclude the log hold time of pulse triggered latches. For example, sythesis iserts buffers i the sca chai whe skew is comparable to t Clk-Q t h, sigificatly icreasig the area. The LFF has a trasparecy period of about three iverter delays, while the SAFF has very small skew tolerace, cotrolled by sizig M N. The SP Viterbi decoder uses a flip-flop derived from the SAFF, characterized as a stadard cell, where eeded i the ACS. A advatage of the SAFF is its differetial output structure, doublig the drive stregth if both outputs are used i sythesis.. CLOCK TREE INSERTION AND CLOCK DISTRIBUTION Partitioig a ASIC desig ito blocks of, gates or less ca improve sythesis results ad help covergece, by limitig the maximum wire legth [9]. The read chael presets a atural opportuity for desig partitioig. All of the timig critical sigal processig blocks are about, to, gates, with layout areas of to mm i. µm CMOS. Block partitioig the desig requires gated trees to be iserted i the blocks. Also, limitig distributio over a smaller area miimizes skew. owever, the local trees have to be merged ito a global tree with added gatig, which is ot geerally well supported i stadard ASIC methodologies. The local trees i SP are desiged for equal rise ad fall times, ad miimum skew, by buffer sizig ad placemet. Fixig the fa-out at each tree level cotrols the isertio delay, ad prescribed trees cotrol the isertio delay to allow later matchig. For example, for give total flipflop/latch load ad block size the total load is computed. By prescribig the size ad umber of buffers i the last stage the slope is met. Based o the post layout extractio data, the tree ca be trimmed (by shortig or leavig ope buffer outputs) to match the isertio delays, as illustrated i Figure. This reduces the skew to ps.. CUSTOM LOGIC VERSUS SYNTESIS ASIC desigs ca reach high speeds at the price of larger area ad higher power. Figure summarizes the basic tradeoff betwee the area ad speed i sythesized desigs. ACS ad CSA based Viterbi decoders with fixed micro-architecture were sythesized for differet cycles. For loger cycles ( s for ACS), sythesis easily achieves the speed goal. To achieve shorter cycle times sythesis icreases gate sizes, to drive the itercoect ad loads more strogly. Iterestigly, the ACS is smaller for lower speeds, but the two ASIC curves itersect at a period of. s. Custom CSA ACS...... Clock Period (s) Figure : Area delay compariso of sythesized ad custom ACS ad CSA Viterbi detectors. To obtai some data poits o the implemetatio efficiecy of sythesized combiatioal logic, a fuctioally equivalet Viterbi decoder ACS array was implemeted i custom logic. The desig was based o complemetary ad pass trasistor logic, ad supported the same ig style. Eve though the adders ad comparators were implemeted usig differetial logic, the custom ACS was roughly half the size at the same speed as the sythesized versio, because the custom logic used much smaller trasistors ad had less wirig capacitace. Whe the custom desig was doubled i size to the sythesized area, it ra oly % faster tha the sythesized desig. owever, the flip-flops were ot chaged, ad the desig was ot reoptimized (wirig or placemet) for the ew gates.. PROCESS VARIATION After micro-architecture, process ca accout for the greatest differece betwee ASIC ad custom desigs. Speed-biig is ot practical for ASICs, but i our example the disk drive read chael has a o-chip voltage regulator that gives better cotrol over supply voltage ad allows tighter worst ad best-case voltage corers. This does require re-characterizatio of the library for o-stadard corers, but results i a -% speed icrease. 8. SUMMARY AND CONCLUSIONS We have examied techiques used to achieve high speeds i a Mz chip with a ASIC desig methodology. We have idetified several desig techiques, commo to custom desigs, that were used to improve the performace of a disk drive read chael desiged i a ASIC methodology. avig idetified the FIR filter ad Viterbi detector speed bottleecks, architectural trasformatios ad alterative ig styles were used to icrease their speed. The FIR filter could be pipelied, ad computatio i parallel doubles the speed at the price of doublig the area. Architectural trasformatio of the add-compare-ect (ACS) i the Viterbi detector to compare-ect-add (CSA) reduced the critical path legth. Pipeliig was t possible because of the recursive ature of the Viterbi algorithm, but reducig the ig overhead ca icrease the speed further. Reducig skew ad high speed latches (istead of flip-flops) ca reduce the ig overhead. The timig overhead factor improves from about. worse tha custom to. worse.

With this example, we have idepedetly cofirmed the thesis of [], that ASIC desigs ca be brought to withi custom speeds with a proper desig methodology orchestratio, ad attetio to key desig factors. Nevertheless, compared to custom implemetatios, ASICs will still be larger at the same speed, or slower for the same area, which was illustrated comparig custom CSA implemetatios to CSA ad ACS ASIC versios. Quatifyig ad explorig the area ad power gap betwee ASIC ad custom desigs is a good directio for future work. 9. ACKNOWLEDGMENTS Would like to ackowledge the SP desig team, especially Kiyoshi Fukahori, Michael Leug, James Chiu, Bogda Staszewski, Vivia Jia, ad David Gruetter. James Chiu provided the area-delay compariso data.. REFERENCES [] Altekar, S., et al. A Mb/s BiCMOS Read Chael Itegrated Circuit, IEEE Iteratioal Solid-State Circuits Coferece, Digest of Techical Papers, Sa Fracisco CA, February, 8-8,. [] Black, P., ad Meg, T. A MB/s -state radix- Viterbi decoder, IEEE Joural of Solid-State Circuits, vol. -, December 99, 8-88. [] Chiery, D. G., ad Keutzer, K. Closig the Gap Betwee ASIC ad Custom: A ASIC Perspective, Proceedigs of the th Desig Automatio Coferece, Los Ageles CA, Jue, -. [] Fettweis, G., et al. Reduced-complexity Viterbi detector architectures for partial respose sigalig, IEEE Global Telecommuicatios Coferece, Sigapore, Techical Program Coferece Record, vol., November 99, 9-. [] Fettweis, G., ad Meyer,. igh-speed parallel Viterbi decodig algorithm ad VLSI architecture, IEEE Commuicatios Magazie, vol. 9-8, May 99, -. [] Fey, C. F., ad Paraskevopoulos, D. E. Studies i LSI Techology Ecoomics IV: Models for gate desig productivity, IEEE Joural of Solid-State Circuits, vol. SC- -, August 989, 8-9. [] Groowski, P., et al. igh-performace Microprocessor Desig, IEEE Joural of Solid-State Circuits, vol. -, May 998, -8. [8] arris, D., ad orowitz, M. Skew-Tolerat Domio Circuits, IEEE Joural of Solid-State Circuits, vol. -, November 99, -. [9] Jai, R., Yag, P.T., ad Yoshio, T. FIRGEN: a computeraided desig system for high performace FIR filter itegrated circuits, IEEE Trasactios o Sigal Processig, vol. 9-, July 99, -8. [] Lee, I., ad Sotag, J.L. A ew architecture for the fast Viterbi algorithm, IEEE Global Telecommuicatios Coferece, Sa Fracisco CA, Techical Program Coferece Record, vol., November, 8. [] Marvell Itroduces ighphy TM, the Idustry s First Read Chael PY to Exceed Gigahertz Speeds, December. http://www.marvell.com/ews/dec_.htm [] Messerschmitt, D. G. Breakig the recursive bottleeck, i Skwirzyski, J.K. (ed.) Performace Limits i Commuicatio Theory ad Practice, Kluwer, 988, -9. [] Nazari, N. A Mb/s disk drive read chael i. µm CMOS icorporatig programmable oise predictive Viterbi detectio ad trellis codig, IEEE Iteratioal Solid-State Circuits Coferece, Digest of Techical Papers, Sa Fracisco CA, February, 8-9, 9. [] Nikolić, B. et al. Sese amplifier-based flip-flop, IEEE Joural of Solid-State Circuits, vol., Jue, 8-88. [] Partovi,., Clocked storage elemets, i Chadrakasa, A., Bowhill, W.J., ad Fox, F. (eds.). Desig of igh- Performace Microprocessor Circuits. IEEE Press, Piscataway NJ,, -. [] Partovi,., et al. Flow-through latch ad edge-triggered flip-flop hybrid elemets, IEEE Iteratioal Solid-State Circuits Coferece, Digest of Techical Papers, Sa Fracisco CA, February 99, 8-9. [] Staszewski, R.B., Muhammad, K., ad Balsara, P. A - MSample/s 8-Tap FIR digital filter for magetic recordig read chaels, IEEE Joural of Solid-State Circuits, vol. - 8, Aug.,. [8] Stojaovic, V., ad Oklobdzija, V.G. Comparative aalysis of master-slave latches ad flip-flops for high-performace ad low-power systems, IEEE Joural of Solid-State Circuits, vol. -, April 999, -8. [9] Sylvester, D.; Keutzer, K. Gettig to the bottom of deep submicro, Proceedigs of the Iteratioal Coferece o Computer Aided Desig, Sa Jose CA, November 998, -. [] Syopsys Desig Compiler, Referece Maual, Syopsys. [] Thapar,. K. ad Patel, A.M. A Class of Partial Respose Systems for Icreasig Storage Desity i Magetic Recordig, IEEE Trasactios o Magetics, vol. MAG-- part, September 98, -8. [] Texas Istrumets SP CMOS Digital Read Chael, 999. http://www.ti.com/sc/docs/storage/products/sp/ idex.htm [] Weste, Neil.E., ad Eshraghia, Kamra. Priciples of CMOS VLSI Desig, d Ed. Addiso-Wesley, Readig MA, 99, -. [] Yeug, A.K., ad Rabaey, J.M. A Mb/s radix- bitlevel pipelied Viterbi decoder, IEEE Iteratioal Solid- State Circuits Coferece, Digest of Techical Papers, Sa Fracisco CA, February 99, 88-89,.