A Wave-Pipelined On-chip Interconnect Structure for Networks-on-Chips

A Wave-Pipelined On-hip Interonnet Struture for Networks-on-Chips Jiang Xu and Wayne Wolf Dept. of ELE, Prineton University jiangxu@prineton.edu, wolf@prineton.edu Abstrat The paper desribes a strutured ommuniation link design tehnique, wave-pipelined interonnet, for networks-on-hip. We ahieved 3.45GHz and 55.2Gbps throughput on a 10mm 16bit interonnetion in a 0.25um tehnology. It uses 0.079mm 2 of area, and it only needs 18.8pJ to transmit one bit. We redue 79% rosstalk delay by using two tehniques -- interleaved lines and misaligned repeaters. This paper shows the various tehniques we used to save power and area and ahieve high performane in a relative old tehnology in detail. Wave-pipelined interonnet design is relatively easy, but many features of it give a large and flexible design spae for high-performane hips. 1. Introdution This paper introdues an on-hip interonnet struture using wave pipeline for networks-on-hip (NoC). By using this struture, we ahieved 3.45GHz on a 10mm bidiretional on-hip interonnet in a 0.25um aluminum tehnology, and it gives a 55.2Gbps throughput on a 16bit interonnet. It uses very small area and saves power. The design is easy. It supports asynhronous ommuniation and globally asynhronous and loally synhronous senarios. The number of usable transistor is inreasing exponentially to help designers to fulfill the growing market demand. High performane and multifuntion hips, for example, system-on-hips (SoC) will work at very high frequeny (about 10 GHz) and have huge amount of transistors (more than one billion) and large area (about 572mm 2 ). These hips will use globally asynhronous and loally synhronous lok shemes [2]. Global interonnets are needed to ommuniate among different lok zones. They have to be high performane in many ases, for example, the interonnet between a proessor ore and embedded memories. Another design trend is reusing IP (Intelletual Property) ores. To aelerate large designs, designers try to reuse large number of IP ores on a single hip. We, among many other researhers, suggest onneting IP ores by NoC. NoC an be reused and further redue the time-to-market [1]. Global interonnets are important elements of NoC, and they onnet IP ores together. Strutured interonnet is the fundamental of NoC design. Wavepipelined interonnet is a tehnique to support strutured interonnet. While transistors beome smaller and faster, the International Tehnology Roadmap for Semiondutor (ITRS) [3] predits that global interonnets will beome slower and relative larger, even after repeaters are properly inserted. On one hand, due to smaller feature sizes, the area of interonnets ross setions dereases, and the resistane per unit length inreases, whih inreases delay. On the other hand, global interonnets beome taller and relatively loser, and rosstalk beomes a serious problem. Lower supply voltages make ommuniation through global interonnets more sensitive to rosstalk. Slow global interonnets limit the ommuniations between IP ores, whih will run at GHz level. In this paper, we show a new tehnique for on-hip interonnets using wave pipeline, whih attaks the performane, rosstalk, and power issues in global interonnets. This tehnique also gives designers more flexibility on NoC design. In the following setion, we talk about some related previous works. We will detail our tehnique in setion 3. Setion 4 shows the simulation results and analysis. A design methodology for better using this tehnique in NoC design is desribed in setion 5. A brief onlusion is given in setion 6. 2. Previous work In the near future, data and ontrol signals will need multiple lok yles to ross a hip using relatively slow global interonnets [4]. This fat will fore designers to pipeline data on global interonnets to ahieve higher throughput under a large delay. An optimisti view is many data are quite large, and only the first set of bits will suffer the large delay. As with other pipelines, the ommuniation performane of wave-pipelined interonnets are related with the delay, the pipeline Proeedings of the 11 th Symposium on High Performane Interonnets (HOTI 03) Authorized liensed use limited to: Hong Kong University of Siene and Tehnology. Downloaded on August 19, 2009 at 08:46 from IEEE Xplore. Restritions apply.

throughput, and the data size. We show the relationship in another paper [5]. Our study shows that wave pipeline is a favorable hoie for ommuniations on global interonnets. Wave pipeline is brought up 34 years ago [6]. In wave pipeline, data are pipelined in iruits without using lathes, beause wires and transistors not only transmit data and ompute but also store them for a period of time. Instead of using lathes, wave pipeline stores data on wires and transistors on the fly. Usually wave pipeline design is very hard, beause different paths from one input to one output of a iruit often have different delays, and all the delays must be balaned [7]. But the benefit is signifiant. The delay overhead of lathes an be saved. Inreasing number of praties is brought up in reent years [8] [9] [10]. RAMBUS memory is a suessful design using wave pipeline. We show global interonnet is yet another right plae to use wave pipeline in our previous paper [5]. The relatively simple and regular iruit of interonnet makes wave pipeline design diffiulties muh easier. The benefit by using this advaned tehnique is great. If a global interonnet is pipelined using lathes, those lathes must be at least as large as repeaters on the interonnet, beause they need not only to store data but also to drive the next interonnet setion, whih needs very large inverters. In our ase, these inverters are 20~50 times larger than the smallest inverters. Large inverter is slow and power-hungry. Needless to say, lathes on global interonnet will introdue a lot of delay, onsume a lot of power, and use more area. Wave pipeline does not use lathes, so it has less delay and uses muh less power and area, omparing to traditional methods. Wave pipelined interonnet also gives designers a new method and larger spae to tradeoff between performane, area, and power. In the previous work, we show wave-pipelined interonnet itself without the senders and reeivers. In this work, we show other features of a omplete wavepipelined interonnet design. 3. Design desription Our design ahieves 3.45GHz on every single bit line in an old but widely used 0.25um tehnology, and we expet it to ahieve muh higher frequenies in new tehnologies. The maximum hip size is about 300mm 2 at 0.25um tehnology and 572mm 2 from 107nm to 65nm tehnologies [3]. So we hoose a moderate length, 10mm, for our wave-pipelined interonnet design. The design uses TSMC 0.25um tehnology and is simulated using Cadene Spetre. We use metal-3 in a 5 metal layer proess. The delay-optimized 3.2um-pith global interonnets are 1.2um wide based upon the researh by Kahng [11], and we verified part of his results. There are three omponents in our design (Figure 1). Data and lok are fed into the same drivers (or senders), and then they are pumped into lok and data lines. At reeiver end, retified data signals are aught by flip-flops driven by amplified loks. In following, we desribe eah omponent in detail. 3.1. Clok and data lines The bidiretional wave-pipelined interonnet is 16 bits wide. It ould be easily hanged to any other width, for example 128 bits, as long as a lok signal is transmitted along with a set of data lines. For the tehnology we use, one lok is along with every eight data lines. The proportion is determined by the number of reeivers a single lok signal an drive, and it may hange in different tehnologies. In the 16-bit wavepipelined interonnet, there are two lines for two reversed lok signals embedded with data lines on eah diretion. The loks are used by reeivers to lath data. By sending loks along with data, our design avoids using high-frequeny phase loked loops (PLL) or oding shemes to regenerate timing information from data. A diret result of this is that several lok yles are saved. Clok lines use the same layout as data lines. This eases mathing delays between data and lok lines in design and fabriation. Also the environment will have equal effets on both data and lok lines. These haraters are highly desirable in most designs. Clok iruit is the ritial path of our design, beause with the same signal strength, a lok drives about 40 times more load than a data signal. Sending two reversed loks avoid the delay overheads for generating a reversed lok and amplifying lok signals to math the load. loks data driver lok and data lines amplifier Flipflop retifier reeiver data Figure 1. Wave-pipelined interonnet struture for one diretion Crosstalk is a major issue in global interonnet design. It auses delay variation among interonnets. We observed as muh as 140ps delay variation on the 10mm interonnet between the fastest and slowest data signals. Crosstalk also distorts signals and so limits the pipeline frequeny. Coupling apaitane and indutane among nearby lines reate rosstalk. Coupling apaitane makes signals in nearby lines to aelerate eah other if they hange in the same diretion and hold bak eah other if they hange in different diretions. Coupling indutane always has an opposite effet than oupling apaitane. Proeedings of the 11 th Symposium on High Performane Interonnets (HOTI 03) Authorized liensed use limited to: Hong Kong University of Siene and Tehnology. Downloaded on August 19, 2009 at 08:46 from IEEE Xplore. Restritions apply.

While oupling apaitane redues with the square of distane, oupling indutane redues linearly with the distane. In our design, oupling apaitane is dominant. But we expet more oupling indutane effets in next several tehnology generations. We use interleaved lines and misaligned repeaters to redue rosstalk delay variation from 140ps to 29ps (Figure 2). We interleave the data and lok lines for two diretions, up-link and down-link. Interleaving inreases the distane among lines for eah diretion. In many ases, up-link and down-link are not ative at the same time, and then lines from the idle diretion serve as grounded shields for the lines from the ative diretion. When both diretions are ative at the same time, misaligned repeaters [11] redue rosstalk delay among nearby lines. We assign repeaters from different diretions about 2.5mm from eah other, so the reversed signal hanging ativities on the two sides of an repeater will anel eah other s effets on nearby lines. For example in Figure 2, we assume the signal on line A0 is 0 1 in region 0 and the signal on line B0 is also 0 1 in region 0. Then in region 0, line A0 aelerates line B0. However, in region 1, the signal on line A0 is 1 0 and the signal on line B0 is still 0 1. Then in region 1, line A0 holds bak line B0. In total, line A0 has two opposite effets on line B0, and they anel eah other. during a long idle period, we also send one more lok edge to transfer the datum stored in the first stage of a flip-flop to its seond stage, whih a feed-bak loop an keep it as long as needed. The sender is a hain of inverters, whih amplify and pump low-strength signals into lok and data lines. The interfaes of wavepipelined interonnet ease the onnetion between it and other parts of a hip. The input of a sender an be driven by a minimum size inverter. The output data an be store in a serial buffer and be piked up using a loal lok. 3.3. Signaling The signaling in our design is simple. Exept for data lines, there are only lok lines. Cloks also embedded ontrol information. They mark the beginning and the end of data. Cloks are sent only when there are data sent, and there is no warm-up time for lok. After the last data, one more lok edge is sent to transfer datum in the first stage of the flip-flops to its seond stage. Up-link and down-link frequenies are not neessary the same. In many designs, there are different performane requirements for up-links and down-links. out 1 3.2. Sender and reeiver At reeiver end, data signals are retified to sharpen the edges and smooth out noises before they an be used. Then retified data are aught by flip-flops in reeivers. It is diffiult to design a flip-flop working at 3.45GHz in the 0.25um tehnology. Instead, we use a pair of flip-flops (Figure 3), whih work at different lok edges. While one flip-flop athes a datum at rising lok edges, the other flip-flop athes the next datum at falling lok edges. in out 2 A0 A1 Region 0 Region 1 Figure 2. Interleaved lines In a flip-flop, the feed-bak loop prevents data from dissipating, but it also inreases the response time to inoming data. To inrease the sensitivity and redue respond time of a flip-flop, we simplify the first stage to a single inverter. And to prevent data from being lost B0 B1 Figure 3. A reeiver with a pair of flip-flops Cloks onsume a lot of power during an ative period, and our design redues its power in two ways. First, lok frequeny is the half of the pipeline frequeny. Seond, the loks are sent along with data, and loks are idle when no data need to be sent. Muh power is saved by reduing frequeny and ative period. 4. Simulation results and analysis We simulated the whole design in Cadene Spetre. The design work stably at 3.45GHz on every data line, and the total throughput is 55.2Gbps. The area is about Proeedings of the 11 th Symposium on High Performane Interonnets (HOTI 03) Authorized liensed use limited to: Hong Kong University of Siene and Tehnology. Downloaded on August 19, 2009 at 08:46 from IEEE Xplore. Restritions apply.

0.079mm 2. It needs 18.8pJ to transmit one bit, inluding the lok overhead. 4.1. Delay, throughput, and delay variation We measured the 50% delay from the driver inputs (Table 1). Up-link and down-link have different delays beause they have slightly different strutures, but if neessary, they ould be designed similarly. Signals fed into drivers are 290ps wide inluding 20ps skews on front and bak ends. Table 1. Delay and pipeline stages Up-link/down-link Before flipflop After flip-flop 50% delay (ps) 793/910 1167/1290 Pipeline stages 2.7/3.1 4.0/4.4 Delay variation(ps) 29 -- The throughput on eah data line is 3.45Gbps. For uplink, data need 793ps to reah a flip-flop, and there are 2.7 pipeline stages. Beause wave pipeline does not use lathes to loking data, so there ould be a fragment of pipeline stage. Flip-flops use another 374ps or 1.3 pipeline stage to ath and store data. The delay variation is less than 29ps, ompared to 140ps without using interleaving and misalignment. 4.2. Power When ative, the maximum power onsumption of the 55.2Gbps up-link is 1039mW, and the energy effiieny is 18.8pJ/bit inluding lok overhead (Table 2). When idle, the leakage power is 0.43mW. A typial 10mm nonpipelined global interonnet working at 2.64GHz uses 20.5pJ/bit without onsidering reeiver and lok [5]. Wave-pipelined interonnet works 1.3 times faster and are more energy effiient. In the total power, loks use 125mW or 12%. This power will be wasted if no data is sending. We send loks along with data, and instead of 125mW, only 0.34mW leakage power is onsumed when no data are sent. Repeaters use 34% of the total power, and drivers use 55% of the total, beause large size inverters are used to drive long interonnets. 4.3. Area Based upon MOSIS lambda rules, the area of drivers inluding lok iruit is about 0.067mm 2, and the area of reeivers inluding lok iruit is about 0.012mm 2. Compare to 2.7mm 2 of ARM9E-S or 2.0mm 2 of PowerPC405 [12] [13], the area is very small. The interonnet uses about 1.13mm 2 of metal. The area will be even less in newer tehnologies. Beause wavepipelined interonnet has a small area, it an be used to give data-hungry IP ores enough ommuniation bandwidth. 4.4. Analysis Wave-pipelined interonnet an be designed at various length and bit width other than 10mm and 16 bits. It fits well into the globally asynhronous and loally synhronous senarios when more than one billion transistors are on a single hip. Wave-pipelined interonnet an be used to ommuniate between asynhronous lok zones. Beause sender lok signals are transmitted along with data, reeivers do not need to use loal lok to ath data. After data are aught, they an be synhronized to loal lok by storing them in buffers. The buffers are written using sender loks and read using loal lok. The high working frequenies of wave-pipelined interonnet easily math loks of IP ores. Beause it an run at the same lok as loal IP ores, there is no need to generate an extra lok for wave-pipelined interonnet, and this also simplifies the timing issues on the interfaes. Table 2. Maximum power and energy effiieny Power (mw) Energy effiieny (pj/bit) Data Driver 571 10.3 Reeiver 30 0.543 Interonnet 313 5.67 repeater Data total 914 16.6 Clok Driver 72 1.30 Amplifier 14 0.254 Interonnet 39 0.707 repeater Clok total 125 2.26 Total 1039 18.8 Wave-pipelined interonnet an also be extended to buses, where multiple IP ores share the high bandwidth global interonnets. There will be different wave pipeline stages between different IP ores, but the design is still simple beause there is no need to balane the delay among eah pair of IP ores as designing a pipelined bus using lathes. 5. Design methodology Strutured ommuniation links are the fundamental of a design using NoC. Wave-pipelined interonnet is a Proeedings of the 11 th Symposium on High Performane Interonnets (HOTI 03) Authorized liensed use limited to: Hong Kong University of Siene and Tehnology. Downloaded on August 19, 2009 at 08:46 from IEEE Xplore. Restritions apply.

tehnique to support strutured links. Wave-pipelined interonnet an be integrated into a design flow easily (Figure 4). After an arhiteture and a NoC for it are hosen, designer an identify the ommuniation links and their throughputs and maximum tolerable delays. The links have long delays and high throughputs ould be implemented using wave-pipelined interonnets. The maximum delays limit the maximum length of interonnets. Based upon the throughputs and delays, wave-pipelined interonnet an be designed using the method mentioned in our previous paper [5]. Designer an get the area and power estimations of the wavepipelined interonnets and use them to estimate the performane, area, and power of the arhiteture. Choose arhiteture and NoC Identify ommuniation links Design wave-pipelined interonnets Estimate performane, area, and power Figure 4. Design methodology Wave-pipelined interonnet helps to overome some diffiulties of NoC design. Beause wave-pipelined interonnet has a simple struture, it is easy to automate the design and even build a library of IP ores for wavepipelined interonnets with various lengths and performane. Also wave-pipelined interonnet gives more room on performane, power, and area along with other useful features, and designs are more flexible by using them. 6. Conlusion Supporting NoC design, wave-pipelined interonnet has a simple yet effiient struture. Wave-pipelined interonnet has a high performane while using less area and more energy effiient. It has several major advantages over pipelines using lathes. It is ompatible with the high lok frequenies of IP ores and able to give enough bandwidth to data-hungry IP ores. It fits well in globally asynhronous and loally synhronous lok shemes. Wave-pipelined interonnet design an be integrated into a design flow and shows designers a larger and more flexible design spae. 7. Aknowledgement This work was supported in part by New Jersey Commission on Siene and Tehnology (NJCST). 8. Referenes [1] William J. Dally, Brian Towles, Route pakets, not wires: on-hip interonnetion networks, Proeedings of the 38th Design Automation Conferene, 2001. [2] T. Meinke, A. Hemanj, et, Globally asynhronous loally synhronous arhiteture for large high-performane ASICs, IEEE International Symposium on Ciruits and Systems VLSI, 1999. [3] Semateh, International Tehnology Roadmap for Semiondutors, 2001 Edition. [4] Jason Cong, An interonnet-entri design flow for nanometer tehnologies, Proeedings of the IEEE, 89(4): 505-528, 2001. [5] Jiang Xu, Wayne Wolf, Wave pipelining for appliationspeifi networks-on-hip, International Conferene on Compilers, Arhiteture, and Synthesis for Embedded System, Grenoble, 2002. [6] L. Cotton, Maximum rate pipelined systems, Proeedings of AFIPS Spring Joint Computer Conferene, 1969. [7] Wayne P. Burleson, Maiej Ciesielski, Fabian Klass, Wentai Liu, "Wave-pipelining: A tutorial and researh survey", IEEE Transations on VLSI systems, 6(3): 464-474, 1998. [8] O. Hauk, A. Katoh, S.A. Huss, VLSI system design using asynhronous wave pipelines: a 0.35um CMOS 1.5 GHz ellipti urve publi key ryptosystem hip, International Symposium on Asynhronous Ciruits and Systems, 188-197, 2000. [9] Byoung-Hoon Lim, Jin-Ku Kang, A self-timed wave pipelined adder using data align method, Asia-Paifi Conferene on ASICs, 77-80, 2000. [10] D. Wong, G. De Miheli, M. Flynn, Designing high performane digital iruits using wave pipelining: Algorithms and pratial experienes, IEEE Transations on CAD, 12(1): 25-46, Jan. 1993. [11] Andrew B. Kahng, Sudhakar Muddu, Egino Sarto, Rahul Sharma, Interonnet Tuning Strategies for High-Performane ICs, Design, Automation, and Test in Europe, 1998. [12] ARM9E Family Flyer, www.arm.om. [13] PowerPC 405 Core White Paper, www.ibm.om. Proeedings of the 11 th Symposium on High Performane Interonnets (HOTI 03) Authorized liensed use limited to: Hong Kong University of Siene and Tehnology. Downloaded on August 19, 2009 at 08:46 from IEEE Xplore. Restritions apply.