Optical clock distribution for a more efficient use of DRAMs

Similar documents
BUSES IN COMPUTER ARCHITECTURE

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Amon: Advanced Mesh-Like Optical NoC

Logic Design. Flip Flops, Registers and Counters

Contents Circuits... 1

Metastability Analysis of Synchronizer

More on Flip-Flops Digital Design and Computer Architecture: ARM Edition 2015 Chapter 3 <98> 98

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

Logic and Computer Design Fundamentals. Chapter 7. Registers and Counters

EECS150 - Digital Design Lecture 10 - Interfacing. Recap and Topics

Clocking Spring /18/05

PICOSECOND TIMING USING FAST ANALOG SAMPLING

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

VLSI Chip Design Project TSEK06

Laboratory 4. Figure 1: Serdes Transceiver

Digital Phase Adjustment Scheme 0 6/3/98, Chaney. A Digital Phase Adjustment Circuit for ATM and ATM- like Data Formats. by Thomas J.

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

EECS150 - Digital Design Lecture 2 - CMOS

Efficient Architecture for Flexible Prescaler Using Multimodulo Prescaler

DEPARTMENT OF ELECTRICAL &ELECTRONICS ENGINEERING DIGITAL DESIGN

RX40_V1_0 Measurement Report F.Faccio

EE178 Spring 2018 Lecture Module 5. Eric Crabill

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

Data Converters and DSPs Getting Closer to Sensors

EE178 Lecture Module 4. Eric Crabill SJSU / Xilinx Fall 2005

Lecture 2: Digi Logic & Bus

EE241 - Spring 2005 Advanced Digital Integrated Circuits

A Fast Constant Coefficient Multiplier for the XC6200

High Performance Dynamic Hybrid Flip-Flop For Pipeline Stages with Methodical Implanted Logic

Hardware Design I Chap. 5 Memory elements

Research Article. Implementation of Low Power, Delay and Area Efficient Shifters for Memory Based Computation

P.Akila 1. P a g e 60

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

CHAPTER 6 ASYNCHRONOUS QUASI DELAY INSENSITIVE TEMPLATES (QDI) BASED VITERBI DECODER

Timing Error Detection: An Adaptive Scheme To Combat Variability EE241 Final Report Nathan Narevsky and Richard Ott {nnarevsky,

Module 11 : Link Design

IT T35 Digital system desigm y - ii /s - iii

Optical shift register based on an optical flip-flop memory with a single active element Zhang, S.; Li, Z.; Liu, Y.; Khoe, G.D.; Dorren, H.J.S.

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

Logic Devices for Interfacing, The 8085 MPU Lecture 4

Chapter 4. Logic Design

Figure 1 shows a simple implementation of a clock switch, using an AND-OR type multiplexer logic.

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

Chapter 5 Flip-Flops and Related Devices

CSE 352 Laboratory Assignment 3

Solutions to Embedded System Design Challenges Part II

CMS Conference Report

IMS B007 A transputer based graphics board

Digital Transmission System Signaling Protocol EVLA Memorandum No. 33 Version 3

Chapter 2. Digital Circuits

SMPTE STANDARD Gb/s Signal/Data Serial Interface. Proposed SMPTE Standard for Television SMPTE 424M Date: < > TP Rev 0

Scan. This is a sample of the first 15 pages of the Scan chapter.

Designing for High Speed-Performance in CPLDs and FPGAs

Advanced Devices. Registers Counters Multiplexers Decoders Adders. CSC258 Lecture Slides Steve Engels, 2006 Slide 1 of 20

Universal Asynchronous Receiver- Transmitter (UART)

Clock Generation and Distribution for High-Performance Processors

Logic Design Viva Question Bank Compiled By Channveer Patil

SMPTE-259M/DVB-ASI Scrambler/Controller

Using the MAX3656 Laser Driver to Transmit Serial Digital Video with Pathological Patterns

Sequencing. Lan-Da Van ( 范倫達 ), Ph. D. Department of Computer Science National Chiao Tung University Taiwan, R.O.C. Fall,

A MISSILE INSTRUMENTATION ENCODER

Digital Systems Based on Principles and Applications of Electrical Engineering/Rizzoni (McGraw Hill

Low Power Digital Design using Asynchronous Logic

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Sequential Circuits

DIGITAL ELECTRONICS MCQs

Name Of The Experiment: Sequential circuit design Latch, Flip-flop and Registers

EECS150 - Digital Design Lecture 18 - Circuit Timing (2) In General...

Loop Bandwidth Optimization and Jitter Measurement Techniques for Serial HDTV Systems

Computer Organization & Architecture Lecture #5

Final Exam review: chapter 4 and 5. Supplement 3 and 4

CPS311 Lecture: Sequential Circuits

Critical Benefits of Cooled DFB Lasers for RF over Fiber Optics Transmission Provided by OPTICAL ZONU CORPORATION

Memory Interfaces Data Capture Using Direct Clocking Technique Author: Maria George

3/5/2017. A Register Stores a Set of Bits. ECE 120: Introduction to Computing. Add an Input to Control Changing a Register s Bits

11. Sequential Elements

Outline. EECS150 - Digital Design Lecture 27 - Asynchronous Sequential Circuits. Cross-coupled NOR gates. Asynchronous State Transition Diagram

MUHAMMAD NAEEM LATIF MCS 3 RD SEMESTER KHANEWAL

Music Electronics Finally DeMorgan's Theorem establishes two very important simplifications 3 : Multiplexers

Sequential Circuit Design: Part 1

A NOVEL DESIGN OF COUNTER USING TSPC D FLIP-FLOP FOR HIGH PERFORMANCE AND LOW POWER VLSI DESIGN APPLICATIONS USING 45NM CMOS TECHNOLOGY

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Read-only memory (ROM) Digital logic: ALUs Sequential logic circuits. Don't cares. Bus

Logic Design ( Part 3) Sequential Logic (Chapter 3)

EITF35: Introduction to Structured VLSI Design

Digital Fundamentals: A Systems Approach

Design of a Low Power and Area Efficient Flip Flop With Embedded Logic Module

TABLE 3. MIB COUNTER INPUT Register (Write Only) TABLE 4. MIB STATUS Register (Read Only)

TSIU03, SYSTEM DESIGN. How to Describe a HW Circuit

Synchronization Issues During Encoder / Decoder Tests

VARIABLE FREQUENCY CLOCKING HARDWARE

Midterm Exam 15 points total. March 28, 2011

SatLabs Recommendation for a Common Inter-Facility Link for DVB-RCS terminals

Sequential logic. Circuits with feedback. How to control feedback? Sequential circuits. Timing methodologies. Basic registers

Chapter 9 MSI Logic Circuits

COE328 Course Outline. Fall 2007

FPGA Design. Part I - Hardware Components. Thomas Lenzi

Counter dan Register

3rd Slide Set Computer Networks

IN DIGITAL transmission systems, there are always scramblers

Draft Baseline Proposal for CDAUI-8 Chipto-Module (C2M) Electrical Interface (NRZ)

Transcription:

Optical clock distribution for a more efficient use of DRAMs D. Litaize, M.P.Y. Desmulliez*, J. Collet**, P. Foulk* Institut de Recherche en Informatique de Toulouse (IRIT), Universite Paul Sabatier, 31062 Toulouse, FRANCE Email : litaize@irit.fr * Department of Computing & Electrical Engineering, Heriot-Watt University, Edinburgh EH14 4AS, UK **Laboratoire d Analyse et d Architecture des Systemes (LAAS),CNRS 31077 Toulouse, France I. Motivation interconnect has become increasingly important for the electronics community since memory access times have not kept pace with the increase in speed of the central processor. The gap in relative performance between processors and DRAM memories has grown unabated by 50% per year over the last five years. Although cache memory can partially hide the main memory access time, the main bottleneck at the memory level remains unaffected. DRAM is to stay as the main component for high-capacity memory but its cycle time is not likely to decrease by much due to the very nature of the storage process (capacitive charge/discharge). The parallelisation of DRAM chips has increased the aggregate bandwidth but the latency time of the first word reaching the memory has decreased only slowly. A common solution chosen by all memory manufacturers and computer designers has been to widen the bus between the core processor and the (cache) memory in order to satisfy the bandwidth exhibited by the processor [1]. This strategy might not be the most efficient one. To illustrate this statement, consider a block transfer of 512 bits, say, which corresponds to the transfer of a data block between a core processor and its cache memory. An hypothetical new technology allows us to transfer data at a high-bit rate such as or for example. In the case of a cache, the address of the requested block would have to be submitted to bus arbitration before being sent to memory. Table 1 shows this typical case-study for a bus arbitration of 10ns, an address to be transferred to the memory of (N)=40 bits, a memory read (and write) time of (RW)=40ns and a data block transfer of (M)=512 bits. Frequency, F & Data Bus Size, S 1 bit 8 bits 64 bits 1 bit 8 bits 64 bits Bandwidth (Gbytes/s) F*S Address Transfer Time (ns) T = N/(F*S) Data Transfer Time (ns) D=M/(F*S) Latency Time (ns) L=T+D+RW Efficiency (D/L) Block Transfer Rate (ns -1 ) N/(D+T) 0.0625 80 1024 1144 0.89 0.04 0.5 10 128 178 0.72 0.29 4 1.25 16 57.25 0.28 2.3 1.25 4 51.2 95.2 0.54 0.72 10 0.5 6.4 46.9 0.14 5.8 80 0.06 0.8 40.86 0.019 46.5 Table 1 : Increasing the bus width decreases the usage efficiency of the bus.

The most two important parameters to consider are the latency and the bandwidth. If latency was solely taken into account, the last line of table 1 would offer the best solution at the expense of a low efficiency of the bus (around 2%). This efficiency can be increased by an out-of-order execution strategy which partly masks waiting times at the cache. About 50 transfers would then be needed to saturate the bus, which is not realistic for two reasons : (1) the maximum number of non blocking caches is of the order of two to four and increasing only slowly, and (2) 50 memory banks with uniform and distributed access would be needed. Presently, DRAM chips have a capacity of 64 Mbits with a data bus width of 16 bits. In order to obtain a transfer burst of 512 bits in 4 beats, one would need a memory width of 128 bits, that is 8 chips, or 64 Mbytes per bank. The total memory capacity provided is beyond what is needed. A data bus width of 8 bits offers a better compromise : less than ten accesses saturate the bus, the latency time is increased by 15% only (47ns against 41ns), the memory size needed is adequate and provides the bandwidth required to saturate the bus. This solution is however not the best for the following reason. A cache miss provokes a direct block transfer from the memory to the processor (the cache being bypassed) with the requested word at the beginning. In the case of an 8-bit wide bus running at 10GHz, blocks would be transferred serially and the processor would have to wait the complete transfer of one block before accessing the first word of the next block. In the case of 8 busses of 1-bit wide, the block transfer time would be 8 times as long but block would be transferred in parallel, allowing the simultaneous access of the first word of each block. This solution could prove to be useful for modern microprocessors. The point-to-point high-speed serial bus seems to offer the best use of the memory as well as any functional units which could be connected to it. Multiple serial busses connecting functional units would distribute the load in a balanced manner. The number of serial busses connected to any given units would be such that the aggregate bandwidth is maintained. The latency time would be increased, but this can be today tolerated by modern processors. II. The M3S architecture Taking advantage of the benefits of high-speed serial links, a project called M3S (Multiprocessor with Serial Multiport ) was implemented by one of us (D. Litaize). The project aimed at designing a multiprocessor system with serial links to multiport memory modules. A part of the system was built using only electronic components rather than optoelectronics. A schematic of the system architecture is shown in figure 1. Module 1 Module 2 Module M Module 1 Module 2 Module N Figure 1 : The M3S architecture (schematic) The interconnection network is based on a set of high-speed serial links between M memory modules and N processor modules. A processor module is defined as a set of processors which benefit from the

available network bandwidth. On the other hand, the memory module is serially multiported in order to feed the network bandwidth requirements. Detailed description of the system can be found in [2,3]. In our first implementation, we chose to construct a synchronous machine with a pure electronic distribution of the clock. A 1 GHz central clock was sent to all the modules by a tree of drivers. The cumulative effect of the skew of each driver (around 50 ps) along the tree prevented the construction of a large fan-out system. In addition, each coaxial cable had to be an exact multiple of the distance travelled by the clock signal during a clock period. A second demonstrator used clock recovery circuits [4] leading to an asynchronous approach: each unit is connected to its own independent clock, This kind of circuit integrates serial-parallel conversion. For inter-board interconnections, this type of circuit appears unnecessarily complex (and expensive) and seems better suited for long-distance communications. Indeed, with this circuit, the data signal needs to be encoded using for example the 8B/10B encoding scheme, which decreases the useful bandwidth by 20%. It also renders the interface complex at both ends and is wasteful in silicon real-estate since components need to be fabricated which slice byte-wise the information for transmission, search in a look-up table whilst considering the previous byte to avoid baseline wander, and finally serialise the signal. This scheme, through the breaking of this seamless data flow by the electronic functions described above, seems of limited scope to high-throughput communication links. The distribution of a high-quality clock signal from a central point to all modules irrespective of the connection links is a solution which combines the advantages of both approaches. Optical clock distribution allows the provision of such a signal and a simple phase recovery circuit (the period being provided by the unicity of the clock source) can be implemented to overcome the drawbacks of the second approach. The whole bandwidth is employed usefully at the interface since no encoding scheme is necessary. The transmission of a packet can be carried out by a shift register without change of pace and high data frequency can be implemented by MOS-based carefully-designed logic. Figure 2 shows how it is possible to concentrate high speed logic only in multiplexors and demultiplexors : shift registers are working at a quarter (in this example) of the clock transmission frequency. Counters (not shown) are supposed included in the MUX and DMUX logic to command the multiplexor and the demultiplexor which act as data accelerator and decelerator, respectively. The lengths of connections d1, d2 and d3 are short but arbitrary. d1 central clock d2 shift clock /4 Phase recovery /4 shift register 1 shift register 2 shift register 3 shift register 4 M U X d3 1001110...010S start bit D M U X shift register 1 shift register 2 shift register 3 shift register 4 Figure 2 : Multiplexing and demultiplexing data over 4 shifts registers. III. Optical clock distribution The benefits of a serial link have been shown to be appealing in the multi-ghz region. At this frequency of operation, clock distribution and data transfer would need to be performed with a perfect and stable clock, a task still very complex to achieve with electronic components. Optical clock distribution has been known to provide low jitter over long distances and stable isochronous regions on a chip and multichip level [5]. With this scheme, a central ideal clock (a modulated laser source) is distributed everywhere in

the system enabling, in theory, the same clock shape and frequency. Using this scheme, data would be transferred in NRZ mode so that the all bandwidth of the link (medium and interfaces) is devoted to useful work. A start bit is used to put the reception clock in phase with the clock source. The main engineering problem lies in controlling the phase. A possible circuit shown in figure 3 gives a conceptual logical schematic which could be used to detect the phase in one clock cycle. 0 0 1 0 0 S idle data link shift register central clock period T delay T/3 + shift clock delay T/3 stop reset flip-flop Figure 3 : Phase clock detect All flip-flops are reset in an initial state. As long as the data link is idle, flip-flops load a zero value. The start bit S triggers one flip-flop which locks the other one in a circular way (all clock entries are shifted by a constant delay, one third period in the example). The switch flip-flop enables the right clock phase for the shift register. The design must be very careful, as the time available to lock the flip-flops has to be lower than the delay T/3. This design has not been so far validated. The viability of the optical clock distribution option, beyond the feasibility analysis of the manufacturing process, is ascertained only after having implemented an optical power budget and a noise jitter analysis. In our configuration, a clock distribution onto 16 photodetectors is envisaged using an optical waveguide H-tree structure (the optical backplane) whose branches are formed from polymer waveguides. A laser source is coupled onto the optical backplane using standard laser to fibre connections. The fan-out of the optical beam from the laser source is carried out by 3dB directional couplers and the resulting beams are output using output grating couplers built onto the same substrate as the waveguides. Assuming a grating refractive index of 1.64 for the waveguide, the period of the grating is about 1 micron for a saturation depth of 0.8 micron. The optical power budget indicates a total incident power on each of the detectors of the order of -22.4 dbm for a laser source power of 10dBm. The main losses arise from the design of the grating coupler and the intrinsic loss of the fan-out. The sources of the clock skew and jitter (detectors, waveguides and transmitters) have been calculated theoretically at about 45 ps in the case of the FET photoreceiver. IV. Conclusion High-speed serial links offer the best usage of memory bus, given the still large access time of the memory. The reduction of the bus width in itself allows a cost-reduction of the backplane. We propose an optical implementation which relies on the distribution of the clock to the processor and memory in order to control the phase of the signals.

Beyond the feasibility study of such a scheme which will be explained during the presentation, it should be noted that its implementation relies on conflicting economic arguments. On the one hand, processor chip manufacturers are interested in high performance and architectural innovation. On the other hand, the memory chip manufacturers are interested only in high volume and high capacity; this situation tends to encourage evolutionary or incremental technological steps, as recently witnessed by the new DRAMs architectures proposed by the electronics community [6]. A radical change in memory architecture would best be implemented if processor manufacturers establish alliances with high-performance memory chipmakers. Bibliography [1] Y. Katayama, trends in semiconductor memories, IEEE MICRO, Nov/Dec 97, pp. 10-17. [2] D. Litaize, A. Mzoughi, P. Sainrat, C. Rochange, Towards a shared memory massively parallel multiprocessor, The 19th annual International Symposium on Computer Architecture (ISCA). Brisbane, Australia,19-22 May 92, and ACM SIGARCH Computer Architecture News, Volume 20, number 2, pp. 70-79. [3] ] D. Litaize, A. Mzoughi, P. Sainrat, C. Rochange, The design of the M3S project : a Multiported Shared Multiprocessor, Supercomputing 92, Minneapolis, USA, 16-20 Nov 92, pp. 326-335. [4] Hewlett Packard HDMP-1002 transmitter and HDMP-1004 Receiver [5] M.P.Y. Desmulliez, M.R. Taghizadeh, P.W. Foulk, Optical clock distribution for multichip module, Optical Review, pp. 113-114 (1996). [6] D. Bursky, Advanced DRAM architectures overcome data bandwidth limits, Electronic Design, November 17, pp. 73-88 (1997).