Optical clock distribution for a more efficient use of DRAMs D. Litaize, M.P.Y. Desmulliez*, J. Collet**, P. Foulk* Institut de Recherche en Informatique de Toulouse (IRIT), Universite Paul Sabatier, 31062 Toulouse, FRANCE Email : litaize@irit.fr * Department of Computing & Electrical Engineering, Heriot-Watt University, Edinburgh EH14 4AS, UK **Laboratoire d Analyse et d Architecture des Systemes (LAAS),CNRS 31077 Toulouse, France I. Motivation interconnect has become increasingly important for the electronics community since memory access times have not kept pace with the increase in speed of the central processor. The gap in relative performance between processors and DRAM memories has grown unabated by 50% per year over the last five years. Although cache memory can partially hide the main memory access time, the main bottleneck at the memory level remains unaffected. DRAM is to stay as the main component for high-capacity memory but its cycle time is not likely to decrease by much due to the very nature of the storage process (capacitive charge/discharge). The parallelisation of DRAM chips has increased the aggregate bandwidth but the latency time of the first word reaching the memory has decreased only slowly. A common solution chosen by all memory manufacturers and computer designers has been to widen the bus between the core processor and the (cache) memory in order to satisfy the bandwidth exhibited by the processor [1]. This strategy might not be the most efficient one. To illustrate this statement, consider a block transfer of 512 bits, say, which corresponds to the transfer of a data block between a core processor and its cache memory. An hypothetical new technology allows us to transfer data at a high-bit rate such as or for example. In the case of a cache, the address of the requested block would have to be submitted to bus arbitration before being sent to memory. Table 1 shows this typical case-study for a bus arbitration of 10ns, an address to be transferred to the memory of (N)=40 bits, a memory read (and write) time of (RW)=40ns and a data block transfer of (M)=512 bits. Frequency, F & Data Bus Size, S 1 bit 8 bits 64 bits 1 bit 8 bits 64 bits Bandwidth (Gbytes/s) F*S Address Transfer Time (ns) T = N/(F*S) Data Transfer Time (ns) D=M/(F*S) Latency Time (ns) L=T+D+RW Efficiency (D/L) Block Transfer Rate (ns -1 ) N/(D+T) 0.0625 80 1024 1144 0.89 0.04 0.5 10 128 178 0.72 0.29 4 1.25 16 57.25 0.28 2.3 1.25 4 51.2 95.2 0.54 0.72 10 0.5 6.4 46.9 0.14 5.8 80 0.06 0.8 40.86 0.019 46.5 Table 1 : Increasing the bus width decreases the usage efficiency of the bus.
The most two important parameters to consider are the latency and the bandwidth. If latency was solely taken into account, the last line of table 1 would offer the best solution at the expense of a low efficiency of the bus (around 2%). This efficiency can be increased by an out-of-order execution strategy which partly masks waiting times at the cache. About 50 transfers would then be needed to saturate the bus, which is not realistic for two reasons : (1) the maximum number of non blocking caches is of the order of two to four and increasing only slowly, and (2) 50 memory banks with uniform and distributed access would be needed. Presently, DRAM chips have a capacity of 64 Mbits with a data bus width of 16 bits. In order to obtain a transfer burst of 512 bits in 4 beats, one would need a memory width of 128 bits, that is 8 chips, or 64 Mbytes per bank. The total memory capacity provided is beyond what is needed. A data bus width of 8 bits offers a better compromise : less than ten accesses saturate the bus, the latency time is increased by 15% only (47ns against 41ns), the memory size needed is adequate and provides the bandwidth required to saturate the bus. This solution is however not the best for the following reason. A cache miss provokes a direct block transfer from the memory to the processor (the cache being bypassed) with the requested word at the beginning. In the case of an 8-bit wide bus running at 10GHz, blocks would be transferred serially and the processor would have to wait the complete transfer of one block before accessing the first word of the next block. In the case of 8 busses of 1-bit wide, the block transfer time would be 8 times as long but block would be transferred in parallel, allowing the simultaneous access of the first word of each block. This solution could prove to be useful for modern microprocessors. The point-to-point high-speed serial bus seems to offer the best use of the memory as well as any functional units which could be connected to it. Multiple serial busses connecting functional units would distribute the load in a balanced manner. The number of serial busses connected to any given units would be such that the aggregate bandwidth is maintained. The latency time would be increased, but this can be today tolerated by modern processors. II. The M3S architecture Taking advantage of the benefits of high-speed serial links, a project called M3S (Multiprocessor with Serial Multiport ) was implemented by one of us (D. Litaize). The project aimed at designing a multiprocessor system with serial links to multiport memory modules. A part of the system was built using only electronic components rather than optoelectronics. A schematic of the system architecture is shown in figure 1. Module 1 Module 2 Module M Module 1 Module 2 Module N Figure 1 : The M3S architecture (schematic) The interconnection network is based on a set of high-speed serial links between M memory modules and N processor modules. A processor module is defined as a set of processors which benefit from the
available network bandwidth. On the other hand, the memory module is serially multiported in order to feed the network bandwidth requirements. Detailed description of the system can be found in [2,3]. In our first implementation, we chose to construct a synchronous machine with a pure electronic distribution of the clock. A 1 GHz central clock was sent to all the modules by a tree of drivers. The cumulative effect of the skew of each driver (around 50 ps) along the tree prevented the construction of a large fan-out system. In addition, each coaxial cable had to be an exact multiple of the distance travelled by the clock signal during a clock period. A second demonstrator used clock recovery circuits [4] leading to an asynchronous approach: each unit is connected to its own independent clock, This kind of circuit integrates serial-parallel conversion. For inter-board interconnections, this type of circuit appears unnecessarily complex (and expensive) and seems better suited for long-distance communications. Indeed, with this circuit, the data signal needs to be encoded using for example the 8B/10B encoding scheme, which decreases the useful bandwidth by 20%. It also renders the interface complex at both ends and is wasteful in silicon real-estate since components need to be fabricated which slice byte-wise the information for transmission, search in a look-up table whilst considering the previous byte to avoid baseline wander, and finally serialise the signal. This scheme, through the breaking of this seamless data flow by the electronic functions described above, seems of limited scope to high-throughput communication links. The distribution of a high-quality clock signal from a central point to all modules irrespective of the connection links is a solution which combines the advantages of both approaches. Optical clock distribution allows the provision of such a signal and a simple phase recovery circuit (the period being provided by the unicity of the clock source) can be implemented to overcome the drawbacks of the second approach. The whole bandwidth is employed usefully at the interface since no encoding scheme is necessary. The transmission of a packet can be carried out by a shift register without change of pace and high data frequency can be implemented by MOS-based carefully-designed logic. Figure 2 shows how it is possible to concentrate high speed logic only in multiplexors and demultiplexors : shift registers are working at a quarter (in this example) of the clock transmission frequency. Counters (not shown) are supposed included in the MUX and DMUX logic to command the multiplexor and the demultiplexor which act as data accelerator and decelerator, respectively. The lengths of connections d1, d2 and d3 are short but arbitrary. d1 central clock d2 shift clock /4 Phase recovery /4 shift register 1 shift register 2 shift register 3 shift register 4 M U X d3 1001110...010S start bit D M U X shift register 1 shift register 2 shift register 3 shift register 4 Figure 2 : Multiplexing and demultiplexing data over 4 shifts registers. III. Optical clock distribution The benefits of a serial link have been shown to be appealing in the multi-ghz region. At this frequency of operation, clock distribution and data transfer would need to be performed with a perfect and stable clock, a task still very complex to achieve with electronic components. Optical clock distribution has been known to provide low jitter over long distances and stable isochronous regions on a chip and multichip level [5]. With this scheme, a central ideal clock (a modulated laser source) is distributed everywhere in
the system enabling, in theory, the same clock shape and frequency. Using this scheme, data would be transferred in NRZ mode so that the all bandwidth of the link (medium and interfaces) is devoted to useful work. A start bit is used to put the reception clock in phase with the clock source. The main engineering problem lies in controlling the phase. A possible circuit shown in figure 3 gives a conceptual logical schematic which could be used to detect the phase in one clock cycle. 0 0 1 0 0 S idle data link shift register central clock period T delay T/3 + shift clock delay T/3 stop reset flip-flop Figure 3 : Phase clock detect All flip-flops are reset in an initial state. As long as the data link is idle, flip-flops load a zero value. The start bit S triggers one flip-flop which locks the other one in a circular way (all clock entries are shifted by a constant delay, one third period in the example). The switch flip-flop enables the right clock phase for the shift register. The design must be very careful, as the time available to lock the flip-flops has to be lower than the delay T/3. This design has not been so far validated. The viability of the optical clock distribution option, beyond the feasibility analysis of the manufacturing process, is ascertained only after having implemented an optical power budget and a noise jitter analysis. In our configuration, a clock distribution onto 16 photodetectors is envisaged using an optical waveguide H-tree structure (the optical backplane) whose branches are formed from polymer waveguides. A laser source is coupled onto the optical backplane using standard laser to fibre connections. The fan-out of the optical beam from the laser source is carried out by 3dB directional couplers and the resulting beams are output using output grating couplers built onto the same substrate as the waveguides. Assuming a grating refractive index of 1.64 for the waveguide, the period of the grating is about 1 micron for a saturation depth of 0.8 micron. The optical power budget indicates a total incident power on each of the detectors of the order of -22.4 dbm for a laser source power of 10dBm. The main losses arise from the design of the grating coupler and the intrinsic loss of the fan-out. The sources of the clock skew and jitter (detectors, waveguides and transmitters) have been calculated theoretically at about 45 ps in the case of the FET photoreceiver. IV. Conclusion High-speed serial links offer the best usage of memory bus, given the still large access time of the memory. The reduction of the bus width in itself allows a cost-reduction of the backplane. We propose an optical implementation which relies on the distribution of the clock to the processor and memory in order to control the phase of the signals.
Beyond the feasibility study of such a scheme which will be explained during the presentation, it should be noted that its implementation relies on conflicting economic arguments. On the one hand, processor chip manufacturers are interested in high performance and architectural innovation. On the other hand, the memory chip manufacturers are interested only in high volume and high capacity; this situation tends to encourage evolutionary or incremental technological steps, as recently witnessed by the new DRAMs architectures proposed by the electronics community [6]. A radical change in memory architecture would best be implemented if processor manufacturers establish alliances with high-performance memory chipmakers. Bibliography [1] Y. Katayama, trends in semiconductor memories, IEEE MICRO, Nov/Dec 97, pp. 10-17. [2] D. Litaize, A. Mzoughi, P. Sainrat, C. Rochange, Towards a shared memory massively parallel multiprocessor, The 19th annual International Symposium on Computer Architecture (ISCA). Brisbane, Australia,19-22 May 92, and ACM SIGARCH Computer Architecture News, Volume 20, number 2, pp. 70-79. [3] ] D. Litaize, A. Mzoughi, P. Sainrat, C. Rochange, The design of the M3S project : a Multiported Shared Multiprocessor, Supercomputing 92, Minneapolis, USA, 16-20 Nov 92, pp. 326-335. [4] Hewlett Packard HDMP-1002 transmitter and HDMP-1004 Receiver [5] M.P.Y. Desmulliez, M.R. Taghizadeh, P.W. Foulk, Optical clock distribution for multichip module, Optical Review, pp. 113-114 (1996). [6] D. Bursky, Advanced DRAM architectures overcome data bandwidth limits, Electronic Design, November 17, pp. 73-88 (1997).